Sign in
Microsoft Web N-Gram
Bringing you web-scale language model data. Web N-Gram is joint project between Microsoft Bing and Microsoft Research.
Translate This Page
Translate this page
Powered by
Microsoft® Translator
Tags
fanout
generative API
language modeling
lexicon
Models
ngram
performance
probability
python
REST
SOAP
speller challenge
tokenization
tokens
unknown
web service
word break
Browse by Tags
MSDN Blogs
>
Microsoft Web N-Gram
>
All Tags
>
probability
Tagged Content List
Blog Post:
The dirty secret about large-vocabulary hashes
Chris Thrasher
The first step in the n-gram probability lookup process is to covert the input into tokens as discussed in an earlier post . The second step is to covert each token into numerical IDs. The data structure used here is a kind of a hash. You might be asking, why not a trie ? The simple answer is size...
on
27 Dec 2010
Blog Post:
Well, do ya, P(<UNK>)?
Chris Thrasher
Today we'll do a refresher on unigrams and the role of the P(<UNK>). As you recall, for unigrams, P(x) is simply the probability of encoutering x irrespective of words preceding it. A naïve (and logical) way to compute this would be to simply take the number of times x is observed and divide...
on
13 Dec 2010
Blog Post:
The messy business of tokenization
Chris Thrasher
So what exactly is a word, in the context of our N-Gram service? The devil, it is said, is in the details. As noted in earlier blog entries, our data comes straight from Bing. All tokens are case-folded and with a few exceptions, all punctuation is stripped. This means words like I'm or didn't are...
on
29 Nov 2010
Blog Post:
Wordbreakingisacinchwithdata
Chris Thrasher
For the task of word-breaking, many different approaches exist. Today we're writing about a purely data-driven approach, and it's actually quite straightforward — all we do is a consider every character boundary as a potential for a word boundary, and compare the relative joint probabilities, with...
on
22 Nov 2010
Blog Post:
Generative-Mode API
Chris Thrasher
In previous posts I wrote how the Web N-Gram service answers the question: what is the probability of word w in the context c ? This is useful, but sometimes you want to know: what are some words {w} that could follow the context c ? This is where the Generative-Mode APIs come in to play. Examples...
on
18 Oct 2010
Blog Post:
Language Modeling 102
Chris Thrasher
In last week's post, we covered the basics of conditional probabilities in language modeling. Let's now have another quick math lesson on joint probabilities. A joint probability is useful when you're interested in the probability of an entire sequence of words. Here I can borrow an equation from...
on
11 Oct 2010
Blog Post:
Language Modeling 101
Chris Thrasher
The Microsoft Web N-Gram service, at its core, is a data service that returns conditional probabilities of words given a context. But what does that exactly mean? Let me explain. Conditional probability is usually expressed with a vertical bar: P(w|c). In plain English you would say: what is...
on
4 Oct 2010
Page 1 of 1 (7 items)