Microsoft Web N-Gram

Bringing you web-scale language model data. Web N-Gram is joint project between Microsoft Bing and Microsoft Research.

Browse by Tags

Tagged Content List
  • Blog Post: Well, do ya, P(<UNK>)?

    Today we'll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x) is simply the probability of encoutering x irrespective of words preceding it. A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide...
  • Blog Post: The fluid language of the Web

    We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10. You can download it here . We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10. Our findings are interesting: The union of the word set...
  • Blog Post: Who doesn't like models?

    If there ever was an overloaded term in Computer Science, it's models. For instance, my colleagues in the eXtreme Computing Group have this terrific ambition to model the entire world ! What we're talking about here is much simpler: it is a representation of a particular corpus. One of the key insights...
  • Blog Post: UPDATE: Serving New Models

    Today's post was delayed slightly but we have good news — announcing the availability of additional language model datasets. As always, the easiest way to get a list is to simply navigate to http://web-ngram.research.microsoft.com/rest/lookup.svc . Shown below are the new items, in URN form: ...
  • Blog Post: Language Modeling 101

    The Microsoft Web N-Gram service, at its core, is a data service that returns conditional probabilities of words given a context. But what does that exactly mean? Let me explain. Conditional probability is usually expressed with a vertical bar: P(w|c). In plain English you would say: what is...
Page 1 of 1 (5 items)