Microsoft Web N-Gram

Bringing you web-scale language model data. Web N-Gram is joint project between Microsoft Bing and Microsoft Research.

Browse by Tags

Tagged Content List
  • Blog Post: The dirty secret about large-vocabulary hashes

    The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post . The second step is to covert each token into numerical IDs. The data structure used here is a kind of a hash. You might be asking, why not a trie ? The simple answer is size...
  • Blog Post: Well, do ya, P(<UNK>)?

    Today we'll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x) is simply the probability of encoutering x irrespective of words preceding it. A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide...
  • Blog Post: The messy business of tokenization

    So what exactly is a word, in the context of our N-Gram service? The devil, it is said, is in the details. As noted in earlier blog entries, our data comes straight from Bing. All tokens are case-folded and with a few exceptions, all punctuation is stripped. This means words like I'm or didn't are...
  • Blog Post: Wordbreakingisacinchwithdata

    For the task of word-breaking, many different approaches exist. Today we're writing about a purely data-driven approach, and it's actually quite straightforward — all we do is a consider every character boundary as a potential for a word boundary, and compare the relative joint probabilities, with...
  • Blog Post: Generative-Mode API

    In previous posts I wrote how the Web N-Gram service answers the question: what is the probability of word w in the context c ? This is useful, but sometimes you want to know: what are some words {w} that could follow the context c ? This is where the Generative-Mode APIs come in to play. Examples...
  • Blog Post: Language Modeling 102

    In last week's post, we covered the basics of conditional probabilities in language modeling. Let's now have another quick math lesson on joint probabilities. A joint probability is useful when you're interested in the probability of an entire sequence of words. Here I can borrow an equation from...
  • Blog Post: Language Modeling 101

    The Microsoft Web N-Gram service, at its core, is a data service that returns conditional probabilities of words given a context. But what does that exactly mean? Let me explain. Conditional probability is usually expressed with a vertical bar: P(w|c). In plain English you would say: what is...
Page 1 of 1 (7 items)