Microsoft Web N-Gram

Bringing you web-scale language model data. Web N-Gram is joint project between Microsoft Bing and Microsoft Research.

Browse by Tags

Tagged Content List
  • Blog Post: The dirty secret about large-vocabulary hashes

    The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post . The second step is to covert each token into numerical IDs. The data structure used here is a kind of a hash. You might be asking, why not a trie ? The simple answer is size...
  • Blog Post: The messy business of tokenization

    So what exactly is a word, in the context of our N-Gram service? The devil, it is said, is in the details. As noted in earlier blog entries, our data comes straight from Bing. All tokens are case-folded and with a few exceptions, all punctuation is stripped. This means words like I'm or didn't are...
Page 1 of 1 (2 items)