The Microsoft Web N-Gram service, at its core, is a data service that returns conditional probabilities of words given a context.  But what does that exactly mean?  Let me explain.

Conditional probability is usually expressed with a vertical bar: P(w|c).  In plain English you would say: what is the probability of w given c?  In language modeling, w represents a word, and c represents the context, which is a fancy way of saying the sequence of words that come before w.

The number of words that the service will consider in a query is known as the order, which is the N in N-gram.  The order is split in to two - one for the word (w) itself, and N-1 for the context (c).   For a 1-gram, or a unigram, there is no context at all, but instead the simple probability of a given word amongst all words.  Let's jump ahead and show you some real values using the Python library available here:

``````>>> import MicrosoftNgram
>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:1')
>>> s.GetConditionalProbability('chris')
-4.1125360000000004
>>> s.GetConditionalProbability('the')
-1.480988``````

For the moment, ignore the details about the instantiation of the LookupService object (i.e. s), and treat it as a black box that can tell you the unigram probability.  The return values are negative because they are log values in base-10.  Because a probability value, in linear space, will always between 0 and 1, the same will be between negative infinity and 0 in log space.  So the odds of seeing my first name is 1:1/P=1/10^(-4.112536) or approximately 1:13000.  Contrast this with the odds of seeing the most common word in English on the web, the: 1:30.

But language modeling is often more interesting at higher-orders of N: bigrams, trigrams, and so on.  Let's try some more examples:

``````>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:2')
>>> s.GetConditionalProbability('star wars')
-1.1905209999999999
>>> s.GetConditionalProbability('star map')
-3.7370559999999999``````

The key change from the earlier example is how s was instantiated, namely the '2' at the very end of the argument.  This indicates that we're interested in bigrams.  There'll be more on models in the near future.    Anyway, the queries show that given the context of 'star', we are more than 100x times likely to have the word 'wars' than 'map.'   And now for a trigram example:

``````>>> s = MicrosoftNgram.LookupService(model='urn:ngram:bing-body:jun09:3')
>>> s.GetConditionalProbability('i can has')
-2.4931369999999999
>>> s.GetConditionalProbability('i can have')
-2.277034``````

Perhaps the prevalance of LOLSpeak on the Web should not be underestimated.