Jeff Dean gave a keynote Building Large Scale Information Retrieval Systems at WSDM 2009. It's actually a presentation on how Google search engine evolves during the past 10 years. Here are my notes for this lecture.Part I - Overview of Search Engine Evolution: 1999 VS 2009Factors to Consider when Designing a Information Retrieval System:1. Corpus Size(# docs to be indexed)2. QPS(Query Per Second)3. Freshness/Update Rate4. Query Latency5. Complexity/Cost of Scoring/Retrieval AlgorithmParameter Change 1999 -> 2009:1. Corpus Size: 70M -> *B ~100X2. QPS: ~1000X3. Refresh: Months -> Minutes ~10000X4. Latency: <1s> <02.s style="font-weight: bold; font-style: italic;">~5X5. Machine Scale: ~1000XConsider 10x Growth when designing, Rewrite for 100x Growth!Part II - Evolution of Google Search Engine~1997 - Circa, Research Prototype- Simple Architecture and Focus on System Distributing/Partitioning- Term vs Doc based Partition: Doc based Win- Disk Based Index, DocID+Posting List with Position Attributes, Byte Aligned Encoding~1999 - Circa, Production- Introduced Cache-- hit rate is low 30~60% due to index refresh and long tail query-- very beneficial, reduce large disk i/o-- hot term first priority to cache, hot and costy request- Replica Index Data-- better performance-- better availabilitySome Summary in late 1990's:Crawler is simple batch system- start with very few urls- queue it when found new urls- stop when have enough docsIndex Serving using cheap machine- no failure handling- added record/chunk checksumIndex Update- once/month- wait traffic to low -> take replica offline -> do update -> start serving~2000 - Dealing with GrowthSituation:- doc size:50m -> 1000m- ~20% query traffic increase/month- Yahoo! dealSolution:add machines constantly- more index shards for larger index size- more index replica for bigger query capacityAnd improve software constantly- better disk scheduling- better index encoding~2001 - Adding In-Memory Index- enough machine memory: holding all index in mem- machine function: replica -> micro shard holding- balancer: cordinator- availability: replicate important docs~2004 - Adding Infrastructure- Generalize tree structured query flow- Generalize balancer concept- New index encoding: group varint encoding- GFS appear in production~2007 - Universal Search- Universal search: combine results from multiple vertical corpus- Realtime search: fast url finding -> crawling -> indexing -> serving cycle- Experiment supporting: have idea -> try it on real data offline and tune -> live experiment on small piece -> roll out and launchPart III - Future Trends- Cross Language Information Retrieval- ACL in large IR system with huge amount of user and dynamic requirement- Automatic Construction of Efficient IR system (one bin for realtime and regular web index with different parameter configuration)- Info extraction from semi-structured data[Reference]1. Challenges in Building Large Scale Information Retrieval Systems[PDF, Video]2. Notes by Another Blogger - http://www.searchenginecaffe.com/2009/02/jeffrey-dean-wsdm-keynote-building.html