Data mining the Bible

The brains behind all that spell-checking, hyphenation, and other text-related work that happens inside Microsoft Word and other products comes from a little DLL done in my group.  The technology behind all of it comes from the field of statistical language processing; basically, instead of a whole bunch of rules that say how to spell or do a word break, we run some algorithms over zillions of text samples that predict the correct result as a probability.

Well that sort of text analysis is important, of course, but some of the people in our group have been playing with this as the basis for a workstation-type tool that lets you do more.  I decided to play with the tool here on my vacation, and as an example I pulled the Project Gutenberg free download of the King James Version of the Bible.  The nice thing about the Bible is that it’s extremely well-studied, so you can usually find somebody who already knows the “correct” answer on questions of what’s in the text.

Anyway,  here are the most common phrases I found:

  • ye shall (709 occurrences)
  • children of israel (557)
  • saith the lord (550)
  • came to pass (447)
  • unto thee (445)

I want to try doing a lot more, like compare different translations, or discover categories of similar words, etc.  I can also do the analysis in any language, including the original Greek or Hebrew. And I can compare my results with other languages too.  Can you think of some other applications for this?

 

 

Published 25 August 05 07:28 by sprague

Comments

# Bibles and Airports » Richard Sprague WebLog : Data mining the Bible said on March 16, 2008 4:26 AM:

PingBack from http://biblesairportsblog.info/richard-sprague-weblog-data-mining-the-bible/

New Comments to this post are disabled

Search

This Blog

Syndication

Page view tracker