Once upon a time, a customer got in touch with us about a font rendering problem, where instead of seeing the characters he expected to see inside his applications, he saw question marks.
- What’s your system locale? we asked.
- I’m in Cleveland, he said, in my upstairs home office.
Some flavor of this story has been going around our group for years, and I’d call it an urban legend if it weren’t for the fact that I’ve observed several flavors of the story myself. The truth is that we have way too many ways for users to indicate their language (and region) preferences. The proliferation of settings not only confuses users but also makes it near impossible for developers to understand which system setting they should use to determine which aspects of the user experiences that they’re trying to create. Over at Go Global there is some documentation (as well as somewhat older stuff on the old globaldev site) that is designed to help users and developers differentiate between user locale, default user locale, system locale, and input locale, but the fact is that these don’t even constitute the full range of settings available to users on several common installations of Windows. A partial list of the settings that users and/or developers are asked to make sense of:
- User locale
- System locale
- Input locale
- Thread locale
- Input locale
- Default location or geoID
- System UI language
- User UI language
On top of these, users may also encounter the browser Accept Language, language settings in Office or other productivity suite software (sometimes including separate settings UI for every application in the same productivity suite), and language and region preference UI exposed to them from various web services (even multiple times across different web services provided by the same publisher). From the user’s perspective, she’s stuck entering the same information over and over again, in UIs that are different enough to be non-intuitive but similar enough in goal to make her wonder why she’s repeating the same task every time she goes to a new website or opens a new application.
And one of the worst parts about this confusing proliferation of settings is that not one of them is reliable to tell you anything about the particular language that a user cares about at any given point in time; the best they can do is give you a ballpark guess as to a user’s typical intentions or behavior. Throw a little multilingualism into the picture, where a user may regularly interact with a computer in more than one language, and things get even more convoluted.
One of the biggest reasons we introduced ELS language detection is to give developers a way to know what language their user cares about much more scenario-specifically; developers of any Windows application on Windows 7 can now find out the user’s active computing language in text input or reading scenarios simply by passing the text to ELS. This means that in many cases, developers can stop using system settings to make swags at user experience (though the user settings may still end up being used as a fallback for language detection, on which more in a future post).
Next up: Which system setting should you use as a fallback to language detection? More to come.
Okay so no, when we say transliteration, we don’t mean translation. But they do have some things in common.
Understanding the difference between language and writing system – which naturally everyone does now, having read the previous blog post – is essential to understanding the difference between translation and transliteration.
As discussed earlier, a script, or a writing system, refers to the way in which a language is written down. To moderately belabor the point:
|
String |
Language |
Script |
|
cat |
English |
Latin |
|
Katze |
German |
Latin |
|
кошка |
Russian |
Cyrillic |
|
قطة |
Arabic |
Arabic |
|
고양이 |
Korean |
Hangul |
|
猫 |
Chinese |
Simplified Chinese |
|
貓 |
Chinese |
Traditional Chinese |
There is no 1:1 correlation between the languages of the world and their writing systems. In some cases, a particular writing system may be used to transcribe a single language (e.g. Thai, Malayalam), but in other cases, a writing system may be used to transcribe multiple languages (e.g. Latin, Cyrillic). In a few cases labels confuse people; for instance, the Arabic writing system is used to represent the Arabic language, along with several others (e.g. Urdu, Dari, Persian).
When people talk about translation, they typically refer to the mapping semantic content from one language into another. Each string in the first column above represents a translation of the English word cat into some other language. This is famously difficult to do well – not just for computers, but even sometimes for bilingual humans. It’s all well and good when you limit yourself to translating the word cat, but gets awfully hard when you move anything with syntactic complexity, and unbelievably hard when you move to anything of significant length. Anyone who has ever studied a foreign language probably has a pretty good idea of how hard this is. Even fluently bilingual people tend to have a hard time doing this to their own satisfaction.
Transliteration is another kind of mapping, only instead of mapping from one language to another, transliteration maps from one writing system to another. If you think of a writing system as a purely notational form for designating the spoken language, it’s easy to see where this kind of mapping might come in handy. By transposing strings from an unfamiliar writing system to a familiar one, the reader knows how to pronounce – and in some cases therefore semantically interpret (but more on that below) – the content. Japanese children, for instance, encounter unfamiliar Kanji all the time as they’re learning to read. Oftentimes they may know the word that the Kanji is representing in the spoken language, but they have no way to connect the word they’re used to hearing with the Kanji that they see in front of them. In fact, Japanese has a phonetic notational system – Yomigana – expressly designed to help Japanese speakers (adults as well as children) read unfamiliar Kanji terms, and there is a standard transliteration of Japanese into Latin script that is frequently used by non-native speakers of Japanese to help them pronounce new Japanese words that they encounter.
In many cases, transliteration really refers to some kind of phonetic transcription, helpful either for non-native speaker scenarios as above or in cases where a particular language may be standardly written in multiple writing systems (e.g. Serbian, which may be written in Latin or Cyrillic scripts). However, in one of the most common use cases for transliteration – the mapping between Traditional and Simplified Chinese – sound doesn’t enter into it, as both writing systems are ideographic and not alphabetic or syllabic. This mapping counts as transliteration rather than translation because it is a transposition between two writing systems that represent the same underlying spoken language, rather than a transposition between two distinct spoken languages.
ELS in Windows 7 exposes programmatic support for several transliteration modules that you can use the help shape user experience for your international customers, including:
· Traditional Chinese <> Simplified Chinese: Use this to reduce localization costs for your application, or to extend the reach of your content to new customers
· Cyrillic > Latin normalization, with a focus on Serbian and Russian: This can also be used to reduce localization costs, or in some cases to enable second language learning scenarios
· Indic > Latin normalization (Devanagari, Bengali, Malayalam): These are phonetic transcriptions that are primarily useful in second language learning or communication scenarios
Two of the first services out of the gate for ELS are language detection and script detection, both of which are available in the PDC build of Windows 7. Both of these are already being used by different components in different ways and I thought it might be useful to do a general overview of what they are, how they differ, and how you can use them.
First, let’s start with defining what we’re detecting. A language for ELS refers to the semantic content of a string, and corresponds pretty much to your intuitive notion. English is a language. French is a language. Chinese is a language. A script in this context refers to the writing system that the string is encoded in. The Chinese language, for instance, is written in two scripts: Simplified, which happens to be used in PRC, and Traditional, which happens to be used in Hong Kong and Taiwan. Cyrillic is a script. Devanagari is a script. Arabic is a language, but it’s also the name of a script (which is used to write the Arabic language as well as some other languages, such as Urdu and Pashto). ELS allows you to retrieve both the language and the script of whatever string you pass in. If you pass this paragraph, you’ll discover that it’s the English language, written in the Latin script.
When it comes to language detection, we do better and better the more text you pass us. If you pass this string:
This is an English sentence.
We’ll return you a guess list of languages in order of our confidence. In this case, the sentence is English, so English (represented with its appropriate ISO code, en) will be returned first on the list. If we make the string a bit shorter:
This is an
We have less data to work with, but we can guess that this is an English fragment, and so our guess list will still return en at the top of the list. As we get smaller and smaller:
This is
This
Thi
Th
T
We’ll return fewer and fewer guesses. We can still handle fragments that are recognizably English, such as the first item on this list, but by the time we work down to the single character or word fragments at the bottom of the list, we’ll start returning an empty string; we’d rather tell you we don’t know than make a bad guess. One of our general goals is to do whatever humans can do in terms of language detection. It’s a good bet that if a human can’t tell what language the single T belongs do, then our language detection can’t either.
However, while a single T doesn’t clearly belong to any particular one language, ELS (and a human) can tell with confidence what script it belongs to, since script identification is a property at the character level. So if you pass a mixed script string to ELS and ask for script detection:
I used to know 望遠鏡で日本人男性.
We’ll identify each script range that we find, telling you which characters are Latin, which are Katakana or Hiragana, and which are Kanji, along with their position in the string. If you pass the same string into language detection, we’ll tell you that we find English and we’ll tell you that we find Japanese, but we don’t break the results into associated ranges. This is arguably something that we should do, but it isn’t implemented this way today (and it’s a non-trivial problem). For this reason, a number of ELS callers are using script and language detection in combination, first passing strings into script detection to range-break, and then passing the resulting subranges back into language detection to up the range accuracy.
One question that people are asking very frequently is how long strings need to be in order to rely on the accuracy of ELS's language detection support. The truth is that it depends somewhat on the string itself. Languages that by virtue of their writing system can be uniquely identified (e.g. Thai) will return accurate results from single-character strings, since those strings cannot be any other language. Where a writing system can be used to represent multiple languages (ELS supports dozens of Latin-script languages, for instance), syntactic fragments (not necessarily conplete sentences) are often required for perfect accuracy. ELS provides script detection for every writing system in Unicode 5.1 and language detection for in the neighborhood of 100 languages. We often end up back to the generalization above; if a human can't tell what language it is, it's a good bet that ELS can't either. And if we're not sure, we'd rather tell you we're not sure than offer you a false positive result.
More usage tips to come!
Can't we all just get along?
One of the struggles that we go through every release is in figuring out how to handle updates to the behavior of linguistic sorting. People rely on NLS APIs to sort strings appropriately for display to users. That means that strings need to be in the right order, where the right order is defined as the order that's going to allow a user to locate some particular item in a sorted list. Most of the time the expected order for a user who speaks some particular language doesn't change a whole lot. A lot of the time the expected order reflects something that people learned in school, or see consistently in dictionaries and encyclopedias and phone books. Oftentimes the order of encoding in Unicode is pretty close to what users expect. And sometimes where the above do not apply, we find highly intuitive and aware native speakers who not only have an innate awareness of what to expect in terms of sort order, but can also articulate why.
Then there are all the other times.
And many of the languages for which technology support is more recent? Are the other times.
For at least one language that we added during Vista, there was not only no national standard available, but four published dictionaries used four different (and internally inconsistent) sort orders. The articulate and perceptive native speakers working with us? Did not agree with each other. For many other languages, literacy patterns are emergent enough that native speaker intuitions are quite unclear and often wrapped up in other issues of cultural identity. There are groups pushing to follow sort orders and literacy traditions from other languages with overlapping character sets. There are groups pushing to establish local standards. There are still other languages for which there are no groups at all.
And yet, in many of these places, people are using computers. And the moment people can read and enter text on the screens in front of them, they're gonna be enveloped in sorted data. In the file system. In Excel files and documents. In other application UI. And for all of those users in all of those languages, the lists need to be predictable and linguistically appropriate. When for whatever reason it's not predictable or appropriate, people tell us about it, and we have bugs to fix. Even for languages with long literacy traditions and relatively advanced technology support, every once in a while the group of people who own the language – and here I mean a speech community of native speakers, who may or may not be aligned to a governmental language authority – change how things work. Spelling reforms get all the media play, but dictionary order reforms happen too. And when that happens, sorting behavior that may have been appropriate at the time we started supporting it stops being appropriate for the users it was intended to support. And then there are all the new characters that get added into Unicode, as new scripts are encoded and new code points are added to existing scripts. These characters get encoded because users need them.
So every time we release a version of Windows or the .NET framework, we end up addressing some issues that reflect either changes in local expectations or bugs in things we’ve shipped before. For applications that rely on us for linguistic correctness of sorted strings, this is a good thing. It means that app developers never need to roll their own behavior for linguistic appropriateness – they need to know how to make the right CompareStringEx() call with the right locale information for their user and they’re done.
But then there are the databases. With indexes that can become corrupt and cause huge compatibility headaches for users.
So anyone who needs to persist sorted data, like people building indexes for a database, needs to rely on consistency of their index. When sorting behavior changes, the index can become corrupt, which means that anyone searching on it won’t find the strings that they expect to find. So all the goodness of linguistically appropriate sorting is great for applications that present strings to users for ordered display, but these updates can wreak havoc for databases that aren’t designed correctly. So many people who persist sorted data avoid linguistic sorting entirely, instead using ordinal sorting to ensure consistency of behavior across a given character set. In Windows 2003, we introduced GetNLSVersion(), which allows developers to query the particular version of sorting behavior that is present on the OS to decide whether or not to reindex. If the version number has changed, then some sorting behavior has changed since the last version, which means that databases probably want to reindex to avoid the risk of corruption.
We hope that index builders are sorting ordinally, but what if they’re not? We think that many databases are using SQL or Jet or some other technology that shields them from having to think about the issue, but what if they aren’t? We believe that most databases go offline to reindex at major OS releases, but what if they don’t? We want people to check the NLS version to make informed decisions about reindexing, but what if people don’t know that they can do this? We know that many more people rely on us for linguistic sorting for display than for persisting ordered indexes, but every release we end up stuck between a rock and a hard place, where we need to make updates to serve the needs of international customers, but we need to be cautious about updates… to serve the needs of international customers.
And that about sums up my week. J
Now that you've been able to go check out our PDC talk (online here), I'm excited to introduce you and this blog to ELS, our next big set of globalization support for developers! From our PDC Developer Guide:
Extended Linguistic Services (ELS) is a new feature in Windows 7 that allows developers to use the same small set of APIs to leverage a variety of advanced linguistic functionality. By using ELS APIs in Windows 7, developers can auto-detect the language of any piece of Unicode text and use that information to help make smarter user experience choices for customers around the world. ELS also offers built in transliteration support that converts text from one writing system to another. For instance, developers can now auto-convert text between Simplified and Traditional Chinese to help people communicate with each other across linguistic boundaries. By using ELS APIs, developers will be able to use existing ELS services as well as pick up new services in the future without learning new code.
You can also read more about it on our Go Global website here: http://msdn.microsoft.com/en-us/goglobal/dd156834.aspx
The white paper takes you through the basics of what we're building and some places where we think you might find it helpful, and also offers some code samples that those of you who are planning to play with the Windows 7 beta bits will be able to try out. Some of the first ELS functionality that's coming your way includes:
- Transliteration functionality that will let you convert automatically between Simplified and Traditional Chinese, allowing you to bring your Chinese content to an even greater number of customers without any localization costs on your part
- Language detection support that handles every language with a supported Windows locale (100+!)
- A simple API set that you will be able to learn once and use to call every service that ELS exposes, now or in the future
The first pieces developed by the ELS team are currently online for customers using the Chinese convertor over at the Windows Live Translator site, so if you want a taste of some of the support that ELS APIs will make available in Windows 7, give it a whirl.
I'm pretty excited about it! If you check it out, I'd love to know what you think. The functionality we're adding now is just the beginning of several services that we have in our roadmap. In the coming weeks, I'll be posting more specifically about our API set and services, so stay tuned.
If you're at the PDC this week, be sure to check out Yaniv Feinberg and Erik Fortune's talk, where they will introduce you to many of our new globalization features for developers in Windows 7. After their talk tomorrow (1:45pm-3pm Pacific Time), I'll be following up back here with a link to a white paper we've written to describe many of new features, complete with code samples that you'll be able to start playing with as soon as you have Windows 7 bits.
Hello world!
It's been a while, but I'm coming out of hibernation to mention a couple of things --
We've been working super hard on a bunch of new things and we're getting excited to show you some of what we've been doing at the PDC this October. Expect a pick-up in the number of blog posts related to some of the ideas that we're currently playing with, as well as some guidance on how to use some of the features that you've probably discovering in Vista. It's been a while since I've blogged, but that means I have a bunch of topics stored up, so get ready to ask questions, give feedback or suggestions, and add me back to or remove me from your RSS feed. :)
I also wanted to mention for all aspiring PMs out there that we're hiring. If you consider yourself an engineering-oriented business thinker, or a business-oriented engineer, we want to hear from you. We have a few new projects and we're trying to find smart and passionate people to add to our team. If you're that person at the party saying "If I only worked for Microsoft, I'd fix Windows by doing [x]," this is your chance to make [x] a reality. We work on core Windows and .NET features, so empathy with software developers is a must. We're thinking about all new ways to enrich the set of developer features across a variety of programming models, and we need creative, smart engineers to help us make that a reality.
You can take a look at a couple of job listings here and here. You can also email me with any questions that you have. I'd love to hear from you.
More to follow soon!
And now, some shameless advertising!
Our team is looking for a good tester to work on our next set of projects. We have some pretty cool stuff in the works for the next generation of globalization support in Windows and .NET, including more comprehensive linguistic functionality and new language coverage as well as additional support in the traditional NLS areas of locales, sorting, and calendars. We work on the core globalization APIs in Windows, .NET, and Silverlight, so there is a lot of opportunity to help shape the developer experience across platforms. If you have developer empathy and an interest in international software, please read the full job description and let us know if you're interested!
For various reasons I've been doing statistical analyses on word frequencies in a number of languages lately, and I found a pattern that differentiates different dialects of Spanish that struck me as curious enough that I asked Gerardo, my test lead from Mexico, about it. I was working with Spanish corpora from Spain and from Mexico, and I noticed statistically significant patterns that differentiate the usage of vosotros and Ustedes across the two varieties of Spanish. The frequency of vosotros in the Spanish corpus was actually about two hundred times greater than it was in the Mexican corpus, for corpora of the same genre (i.e. corpus type difference does not account for greater/lesser presence of second person writing).
I had heard anecdotally that people don't use vosotros in Mexico, and come to think of it I noticed this when we traveled there last spring, kinda -- I mean, pretty much everyone talked to us formally anyway, so I didn't have a lot of occasion to pay attention. Still, I had no idea that the generalization of Usted/es to informal speech patterns was so far along in Mexico. So now I'm wondering whether there is anything pragmatic associated with the use of vosotros for Mexican speakers of Spanish, apart from the obvious not-from-around-these-parts marking. I don't have time to follow up on the corpus analysis on this point in detail, but I'd be interested in reading about it if someone else has.
The last several months have been packed with planning our next set of features in the international space -- talking to customers about what they need and what they want, looking at the teams we have and seeing where our skills are up to the challenges ahead and where we need improvement, and looking at the initial Vista feedback as it begins to roll in. We're seeing some good stuff and some bad stuff. A few themes:
People want to use strings: We've been pleased at the initial adoption of our name-based NLS APIs both in Windows and in .NET. Most existing components still rely on LCID-based support, but new components are increasingly starting to rely on names, and even several existing components are starting to plan for a migration to strings. This is crucial as the rate of custom locale adoption will only increase over the next few years, and in order to benefit from the full range of locales that a user has installed, it will be vital that applications ask for international support using strings rather than LCIDs. Legacy LCIDs aren't going away any time soon, but there will come a time when new locales are identified with strings and strings only.
But strings have to be reliable: As we have seen more than a few times in the last several years, sometimes there are changes in the world situation that require changes to software. A geography that is united as one jurisdiction today may vote to become distinct nations in a month or a year or ten years. We have to make it easy for developers to call us for reliable globalization support for customers even when borders shift and the customers' country of residence is updated. We have done some things well here and other things not so well, so figuring out how to handle these situations at an infrastructure level is one of our key focuses.
Locale Builder is great, so where can I get it? We've gotten a lot of great questions about Locale Builder and we're seeing a number of customers interested in custom locale technology. We're seeing questions from customersacoss the spectrum: from OEMs, from application developers who need to deploy custom locales with their applications, and from users whose languages and regions do not have existing locale support. The initial beta of LB expired, but you can get the current version on the Microsoft Locale Builder download page. Please let us know what you think.
The globalization support that people have today is not enough. People are using what Globalization Services provides today, but all the time we're getting requests for more. In particular, people want more core linguistic support in the OS. This includes updates to our sorting functionality -- we have to follow through on the versioning story that we've begun building -- and also requirements for linguistic functionality that developers cannot get from the operating system at all today.
We have an ambitious list of stuff that we need to work on, but it's also been exciting to hear from customers what is and is not working today. All our work for improvements to existing functionality, or the introduction of new functionality, has come directly from the feedback that we've received. We want to know how you're using our stuff and ways that the experience could be better. We're also interested in whether we have any gaps in existing support that are blocking you in some way.
Just in time for the Vista street date, the Vista International Support Portal is live at www.microsoft.com/globaldev/vista/vistahome.mspx! It contains lots of information about the locales, language packs, LIPs, fonts, and keyboards that we ship, and it makes a handy reference to figure out which languages and regions have support in Vista. You can also visit the portal to find free downloads of several of the internationalization tools that we provide, including the recently updated Microsoft Keyboard Layout Creator, the LocaleBuilder, the Microsoft Transliteration Utility, and the RC1 version of the Microsoft Phonetic Input Tool. Check it out and let us know what you think!
The moment I published my ranty LSA blog part one I was given something else to chew on. I left my laptop and went to Mark Liberman's plenary talk on the future of linguistics. Now when Mark gives a talk on the future of linguistics, chances are that it's worth paying attention. And this did not disappoint.
Following a poker analogy that I'm 99% sure came from Alan Kors, Mark argued that linguistics as a discipline started off with pretty good cards. Once upon a time, every person of even a little schooling received serious education in grammar, rhetoric, and logic. Educated people studied foreign languages, both living and extinct. Reputable scholarly types from Descartes to Jefferson understood the importance of developing analytical skills to be applied to the study of language.
Fast forward to today. He gave a number of cute illustrations of the current status of linguistics as contrasted with the prevalence of other disciplines that started off with far less collateral, psychology being his main example. Looking at indicators such as undergraduate enrollments, LSA vs. APA membership numbers, and the presence of appropriately schlocky supermarket checkout magazines, he demonstrated what everyone knows. Psychology is hot! Linguistics, not so much. Indeed, the discourse in many domains relevant to the study or use of language, including technology, education, cognitive psychology, game theory, evolutionary anthropology, law, neuroscience, and many others, is today dominated by people with little or no background in linguistics.
So then he started to talk about why.
One of his main arguments was that disciplines such as psychology, history, and political science have adopted a sort of big tent approach, where anyone working on even vaguely connected work is invited to the discussion. Contrast this with the approach taken by many linguists, where you're only welcome at the table if you come from a narrow range of subfields in a narrower still range of departments in a narrower still range of universities. Do anything else, and you're not really doing linguistics.
Hey, this is starting to sound a little like the flip side of the first LSA blog post that I made!
Mark encouraged the audience to address this by reaching out to colleagues in other disciplines. If you have someone at your university interested in language and the law, or how reading might be more effectively taught, or rhetoric and composition, invite them to join your graduate group. Make a big tent, and include people, and you're not only going to raise awareness of your discipline, but you might also find your work enriched by the perspective of someone from somewhere else. This is really the right approach in so many respects, and it's not generally implemented very well.
But I started thinking about this, and there's really a whole lot further that we could take it. The thing is that until career paths outside academia are not only tolerated but also supported by faculty in the finest departments, progress will be limited. With regard to technology and education in particular, there is much valuable work being done in government and the private sector. Granted most of it today is not traditional academic research as such. However, one reason that this is so is that people in these branches do not typically bring a rigorous research background to their positions. And one reason that this is so is that people with rigorous research backgrounds are strongly discouraged from diverging from the prescribed path that most of their advisors know most well. But the fact is, if the field of linguistics wants to make inroads into other disciplines, the first thing that needs to happen is that it needs to be okay for promising students to choose research and professional paths other than ongoing direct contribution to theoretical linguistics. This is true both for students within academia and for students who are considering leaving it for a time.
So clearly I'm at Microsoft, so I have a personal story here, but it is a story that I find echoed by many of my colleagues both in academia and outside it. I know that when I was in school, even in a place that does this remarkably better than most places, I felt strongly discouraged from pursuing interests outside theoretical and computational linguistics narrowly defined, even though I had great interest in and passion for related areas within human-computer interaction and education. I was discouraged in part because of the track I was seen to be on, heading for a career in academic research in the syntax-pragmatics boundary. If I had found it easier to pursue some of these related interests as a student, I feel that my work would have improved – and you never know, I might not have been so quick to leave academia. Even once I had decided go into industry, I got real pushback from many people. Did I really want to go into software? How could that possibly be interesting? I believe the questions were genuine, and came from real consternation. I have heard very similar stories from colleagues who displayed interest in education, anthropology, even math or technology. Even at a department that is more genuinely interdisciplinary in its approach than maybe any other.
But in order for the discipline of linguistics to really succeed, you need classically trained linguists to do many things. You need them to develop interests in and do productive work in technology; in curriculum design from elementary schools on up through college; in government language policy positions; in the related fields of computer science, psychology, math, anthropology, and so on. The list really goes on. Although there are some common areas of overlap, it is difficult to delimit the set of possible crossover opportunities, because exactly what you should want are linguists who can take their training and find new ones. Even in publishing schlock magazines!
I think that encouraging students to pursue alternative paths where appropriate has a bunch of less noble but still useful effects as well. Students can get jobs, for one thing. And graduates well placed can help other students get fellowships or internships or jobs of their own. It gives linguists a great opportunity to shape what’s going on in related fields, in governments, and in the private sector. It can also result in formalized collaborations that in turn can result in greater visibility and funding for linguistics departments. Finally, I think the first department that really gets this right -- with alumni mentoring programs, with seminars by people doing related work, with appropriate curriculum flexibility, with government/industry partnerships for internships, etc -- is going to see a major positive response in their attrition rates, because students will see more possibilities for using their skills and engaging their interests rather than fewer. That's very important to maintaining the kind of positivity that is really required to get through the slog that is a PhD.
One of the really cool things about working in international software is that not everyone is like me. My background adds a certain kind of value here. I also have colleagues who have expertise in the more humanities-oriented side of linguistics, in classic software development, in graphic design, in marketing and business, in usability, in finance, in math, and in many, many other areas. They all end up being useful in the work that we do. For me, that's what makes this a great industry to work in, and it is the most striking change between my professional life a few years ago and my professional life today.
On that note, I return to your regularly scheduled discussions on international software. :)
Many years ago Bill Labov wrote an article called How I got into linguistics, and what I got out of it. It describes his path from college, through failed writing jobs and a stint as an inkmaker, up to his eventual discovery of the field of linguistics, Bill writes: "From what I learned about the small, new field of linguistics, it seemed to be an exciting one, consisting mostly of young people with strong opinions who spent most of their time arguing with each other."
Hey, sounds pretty fun!
He goes on to describe his surprise that most of the data linguists of the time seemed to be drawing from was more or less made up out of their heads, with linguists relying on their own intuitions about the grammaticality of the constructions they wanted to study. He explains that he thought he could do better, and that he wanted to study language as it is actually used by real people, in order to get not only at accurate description but also to use that description as the basis for explanatory theories about why Language works the way it does. The article is pretty modest in light of the fact that anyone who's ever taken a course or two in linguistics knows that Bill went on to do this and then some; he really created modern sociolinguistics and arguably empirical linguistics even more broadly construed.
Well, I wasn't the one who found Bill's article. My mother was. It was included in a recruiting booklet that Penn sent around to prospective undergrads. At the time I was planning on something like a double major in math and English or French, not wanting to choose between math and language. My mom read his article and thought, hey, this sounds like something Kieran might like. She passed it along to me, and I found it hugely inspirational. I spent the rest of the my senior year of high school reading everything about linguistics that I was able to get my hands on, with the result that I not only went to Penn and majored in linguistics, but ten years later I found myself graduating from the same institution with a PhD.
Even now I feel very grateful for my background in linguistics, and most especially for the kind of background in linguistics that the Penn department seems pretty uniquely able to provide. Equal measures formal/theoretical and empirical, Penn has world-class programs in both sociolinguistics and computational linguistics, with the result that even students focusing in other areas can't help but graduate with a pretty solid grounding in these approaches to the field. After college I seriously considered changing disciplines and pursuing the philosophy of science. But it seemed to me that one couldn't really do meaningful work in the philosophy of science without having first done meaningful work in some science, and linguistics felt like a perfect candidate: I was interested in it, for one thing, and as a science it seemed quite immature, still working out its methodologies. Sure, there was Penn, but there was also MIT, where most scholars relied primarily on the intuitions of themselves or their colleagues to gather the set of data on which theories would be based. It is a common joke among empirical linguists that you can look at the grammaticality marking of sentences in certain authors' papers over many years; invariably sentences that start out ungrammatical lose their asterisk over decades of consideration.
As history now shows, I ended up disappointed. Maybe it's because I had some background in math.
I found myself increasingly frustrated with work that conflated models of linguistic behavior with explanations of linguistic behavior. I found that most papers by very smart people in theoretical syntax explained word order by stipulating invisible features that words needed to satisfy by moving around. As a model, this is not crazy. It allows for clear taxonification of the set of possible factors that drive word order and other typological facts across languages. But most of the papers I heard or read included -- consciously or otherwise -- some statement of why, and why always boiled down to explanations than were really more like models. It further tickled my internal snark to note that the more Greek letters that a model included, the more likely it was that the author of the model would insist on its explanatory nature. So much for the philosophy of science.
So okay, halfway through graduate school I decided that pure theoretical syntax was not for me, although I still had enough common sense that I wasn't prepared to ditch it altogether. After all, theoretical syntacticians get jobs. I started focusing more intently on applying formal work to areas of discourse or pragmatics, working closely with Ellen Prince, who had made a very successful career in the same vein. I look to work like Ellen's or Bill's best, or some of the stuff Mark Liberman has done, and it's really amazing. It looks at real language used by real people, it stays away from hopelessly abstract formulations, and it starts off with problems that are scoped narrowly enough to be studyable but broadly enough to be interesting. But the more I looked into the literature, the more I talked to people at conferences, the more I discovered that work of that quality was really in the minority, even looking just at the set of work that purports to be empirical (which is itself the minority within the field of linguistics).
And so I finished my dissertation and I left. One reason I left, and one reason I ended up at Microsoft, is that I wanted to work in a context where I could not only look at real problems -- because let's face it, I could have done that in academia to the extent that I could have selected my areas of focus -- but where it would be expected that I do so. A really great dissertation is read by maybe 100, 200 if we're being generous, people. Maybe it is of use to 10 or 15% of them. And that's a really, really great one. So I wanted a context where my background in linguistics would be relevant, where I could still think about problems in that space, but where Greek letters wouldn't count as evidence of scientific theory. It turns out that in order to achieve all those things I had to come to a place that isn't really about science at all, but is instead about technology.
Which brings me up to the minute. I have been spending the last few days attending the annnual Linguistic Society of America meeting. It is my first fully academic conference since I started at Microsoft, and I expected all kinds of ambivalent and confusing feelings. What I have ended up with isn't confusing at all. I've gone to a bunch of talks and walked away with the same frustration with which I left theoretical linguistics two and a half years ago. Many people here are very, very smart. But the papers on the whole are unfalsifiable (back to the ol' model masquerading as explanation) or designed without reasonable experimental controls (in the case of work that is ostensibly empirical). There are exceptions. But there aren't many. It is not different in spirit than what I was reacting to a few years ago, but my recent stint as an engineer has me feeling it even more acutely. I met someone here from Google whose background is very similar to mine, down to academic subfield and current role at Google. He seems to share my bewilderment. But beyond that, it really is just me (or us?). The people with whom I previously identified professionally, people whose intelligence and diligence I really respect, do not share my reactions in the least. Rather, they react as a block, with the good judgment expected of high-calibre linguists. It is I who do not fit in.
The other strange thing about being here is that this is the main arena for first-round job interviews for students looking for assistant professor positions. I interviewed here a few years ago. I have spent a lot of time here talking to people about Microsoft, especially to students who seem to share some of the same perspective that I felt when I decided to go into industry rather than academia. Some of my friends from school have recently landed in tenure-track roles at good institutions, but many more of them are here interviewing. It is odd being on the other side. I don't envy them the anxiety, the self-esteem roller coaster, that saturates the entire process.
It has been very nice to see and catch up with people. I've made some good contacts here who may eventually become good hires for Microsoft. But the best part of this experience is the reaffirmation of the choice I made a few years ago to get out of academic linguistics -- and the clarity with which I can see the ways in which my approaches to the problem space have changed.
If you work for Microsoft or maybe in the software industry more generally and you hear someone described as being "good at process," you'll understand that what this really ends up meaning most of the time is more like "good at applying procedure," or less charitably and in not very secret code, "not good at anything else." As we in Windows are going to be spending a lot of time over the next few months working on our core engineering systems, trying to become "good at process," it's important to understand what this should really mean.
This good at process thing has been on my mind for a while. Because to me, being good at process is almost completely orthogonal to being good at applying procedures. Being good at process should mean being able to look at existing processes critically to make them faster, more efficient, more effective, or otherwise better. It may mean introducing processes that bring order where there was previously mess, or it may mean removing processes that bring red tape without any real upside. Someone who is good at process can take a bird's eye view of how a system is working to identify patterns and see which mechanisms could be changed in order to make the system better.
This is often the opposite of what it means to be good at applying procedures. Someone who is good at applying procedures is most often detail-oriented, which is not only not the same as being good at process, it may even be antithetic to it. A good release PM is good at applying procedures; a great release PM is able to apply procedures but unwilling to where those procedures don't make sense on a larger scale. A great release PM is good at process. And since there's no guarantee that people who are genuinely good at process are also detail-oriented enough to follow through on established procedures, it turns out that it's pretty tough to find a really great release PM (but easy to find so-so release PMs).
The MQ exercise that Windows is currently undertaking -- a quality milestone in which we're supposed to think critically about our processes and build tools that will bolster our core engineering infrastructure to make us more efficient in the future -- really highlights the diference between being good at process Microsoft-stereotype-style and being really good at process. Understanding this distinction seems like a crucial foundation for making any real improvements to process.
What's a LOCALE_SDECIMAL among friends? In Verizon Doesn't Know Difference Between Dollars And Cents, one customer learns the hard way. If you haven't listened to this transcript yet, it's worth it.