Microsoft Web N-Gram

Bringing you web-scale language model data. Web N-Gram is joint project between Microsoft Bing and Microsoft Research.

Browse by Tags

Tagged Content List
  • Blog Post: The dirty secret about large-vocabulary hashes

    The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post . The second step is to covert each token into numerical IDs. The data structure used here is a kind of a hash. You might be asking, why not a trie ? The simple answer is size...
  • Blog Post: Well, do ya, P(<UNK>)?

    Today we'll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x) is simply the probability of encoutering x irrespective of words preceding it. A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide...
  • Blog Post: The fluid language of the Web

    We prepared, as we had for the earlier dataset, the top-100K words list for the body stream for Apr10. You can download it here . We decided to take a closer look at the dataset to how the top 100K lists changed between Jun09 and Apr10. Our findings are interesting: The union of the word set...
  • Blog Post: What can data do for you?

    Let's think of the scale of different lexicons, in terms of order of magnitude: 1,000 - the day-to-day vocabulary of someone in the United States 10,000 - the number of different words in Moby Dick 100,000 - the number of words understood by a state-of-the-art speech recognition engine ...
Page 1 of 1 (4 items)