Data mining the Bible
The brains behind all that spell-checking, hyphenation, and other text-related work that happens inside Microsoft Word and other products comes from a little DLL done in my group. The technology behind all of it comes from the field of statistical language processing; basically, instead of a whole bunch of rules that say how to spell or do a word break, we run some algorithms over zillions of text samples that predict the correct result as a probability.
Well that sort of text analysis is important, of course, but some of the people in our group have been playing with this as the basis for a workstation-type tool that lets you do more. I decided to play with the tool here on my vacation, and as an example I pulled the Project Gutenberg free download of the King James Version of the Bible. The nice thing about the Bible is that it’s extremely well-studied, so you can usually find somebody who already knows the “correct” answer on questions of what’s in the text.
Anyway, here are the most common phrases I found:
- ye shall (709 occurrences)
- children of israel (557)
- saith the lord (550)
- came to pass (447)
- unto thee (445)
I want to try doing a lot more, like compare different translations, or discover categories of similar words, etc. I can also do the analysis in any language, including the original Greek or Hebrew. And I can compare my results with other languages too. Can you think of some other applications for this?