Microsoft Web N-Gram

Bringing you web-scale language model data. Web N-Gram is joint project between Microsoft Bing and Microsoft Research.

Browse by Tags

Tagged Content List
  • Blog Post: The dirty secret about large-vocabulary hashes

    The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post . The second step is to covert each token into numerical IDs. The data structure used here is a kind of a hash. You might be asking, why not a trie ? The simple answer is size...
  • Blog Post: Well, do ya, P(<UNK>)?

    Today we'll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x) is simply the probability of encoutering x irrespective of words preceding it. A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide...
  • Blog Post: Perf tips for using the N-Gram service with WCF

    The support in Visual Studio for WCF makes writing a SOAP/XML application for the Web N-Gram service a pretty straightforward process. If you're new to this, the Quick Start guide might be helpful to you. There are a few tweaks you can make, however, to improve the performance of your application if...
  • Blog Post: The messy business of tokenization

    So what exactly is a word, in the context of our N-Gram service? The devil, it is said, is in the details. As noted in earlier blog entries, our data comes straight from Bing. All tokens are case-folded and with a few exceptions, all punctuation is stripped. This means words like I'm or didn't are...
  • Blog Post: Wordbreakingisacinchwithdata

    For the task of word-breaking, many different approaches exist. Today we're writing about a purely data-driven approach, and it's actually quite straightforward — all we do is a consider every character boundary as a potential for a word boundary, and compare the relative joint probabilities, with...
  • Blog Post: The fluid language of the Web

    We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10. You can download it here . We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10. Our findings are interesting: The union of the word set...
  • Blog Post: Using the MicrosoftNgram Python Module

    Over the past few posts I've shown some samples of the MicrosoftNgram Python module. Writing documentation is not something engineers I know enjoy doing; in fact the only available documentation right now is through help(MicrosoftNgram) . Here's an attempt to rectify the situation. To get started...
  • Blog Post: UPDATE: Serving New Models

    Today's post was delayed slightly but we have good news — announcing the availability of additional language model datasets. As always, the easiest way to get a list is to simply navigate to http://web-ngram.research.microsoft.com/rest/lookup.svc . Shown below are the new items, in URN form: ...
  • Blog Post: Generative-Mode API

    In previous posts I wrote how the Web N-Gram service answers the question: what is the probability of word w in the context c ? This is useful, but sometimes you want to know: what are some words {w} that could follow the context c ? This is where the Generative-Mode APIs come in to play. Examples...
  • Blog Post: Language Modeling 102

    In last week's post, we covered the basics of conditional probabilities in language modeling. Let's now have another quick math lesson on joint probabilities. A joint probability is useful when you're interested in the probability of an entire sequence of words. Here I can borrow an equation from...
  • Blog Post: What can data do for you?

    Let's think of the scale of different lexicons, in terms of order of magnitude: 1,000 - the day-to-day vocabulary of someone in the United States 10,000 - the number of different words in Moby Dick 100,000 - the number of words understood by a state-of-the-art speech recognition engine ...
Page 1 of 1 (11 items)