Our focus in the database world has primarily been on retrieving information from a structured source, through a structured query.

Relational DBs --> structured data, structured query

XML DBs --> semi-structured data, structured query

search --> unstructured data, unstructured query

Actually, search --> data with hidden structure, query with possibly hidden structure

Recent efforts on IR considers data with no explicit structure:

  • Indexing and and retrieval of information-containing objects (text documents, web-pages, image, audio, ..)
    • Want only relevant objects
    • Efficiently from a large set
    • Lets focus mainly on text based documents in this post
  • Related areas
    • DB Management
    • Artificial intelligence
      • Knowledge representation and intelligent action
    • NLP and linguistics
      • linguistic analysis of natural language text
    • Machine learning
      • Systems that improve performance with experience

Traditional Preprocessing steps

  1. Strip irrelevant markup, tags, punctuation
  2. Tokenize
  3. Stem tokens to root words
    1. Porter stemmer: set or local rules to reduce word to its root.
    2. Words with distinct meanings may reduce to the same token
    3. Does not recognize all morphological derivations
    4. (arm, army) --> arm, (police, policy) --> polic
  4. Remove stop words: I, a, the, is..
  5. Detect common phrases
  6. Build Inverted index: keywords --> docs

Basic Retrieval Models

  1. Set-theoretic
    1. Boolean model
    2. Extended Boolean
  2. Algebraic
    1. Vector Space
    2. Extended Boolean
    3. Generalized VS
    4. Latent semantic, neural network, etc.
  3. Probabilistic
    1. Inference/Belief Network

 

Boolean Model

Document Representation: Set of terms

Queries: Exact matches on presence of terms using logical operators AND, NOT, OR,..

No partial matches or ranking

Reasonably efficient implementations - think set operations in relational DBs queries

Rigid: hard to control: # of docs retrieved, rankings, relevance feedback

Vector Space Model

Document Representation: Bag of terms (with frequency)

Queries: Set of desired terms with optional weights

 

t distinct terms remain after preprocessing step, forming a vector space:

Dimension = t = |vocabulary|

Weight assigned to terms in the document, query (i, j): w_i,j

d_j = (w_1j, w_2j, …, w_tj)

Graphically..

image

Term-document matrix:

image

Each real-valued weight corresponds to the degree of significance of that term in the document. 0=no significance I, we, the, ..) or that it does not exist.

Term frequency (tf): tf_ij  = frequency of i in j  divided by frequency of most common term in j

Document frequency (df): number of documents containing term

Inverse document frequency of i: log_2 ( total # documents / df_i)  Log dampens the effect relative to tf

tf-idf weighing commonly used to determine term weight: w_ij = tf_ij idf_i = tf_ij log2 (N/ df_i)

A term common across all documents is given less importance, and rare terms occurring in a document are given higher weight

Query is also treated as a document vector. Now the problem is to find similar documents

Need a way to compute degree of similarity between two vectors: Dot product!

image

Problems: Favors long documents. Does not measure how many terms don't match.

Compute similarity based on Cosine of the angle between two vectors.

image

With V vocabulary and D documents, complexity is O( |V| * |D|) !!

 

Key points:

  1. Query viewed as a document
  2. Measure proximity to it (numerous methods with different optimizations have been proposed)
  3. Measures score/ranking, not just logical matches

Overall problems with Vector Space:

  1. Missing semantic, word-sense type information
  2. Missing Syntactic information - phrase structure, word order, proximity
  3. Does not support boolean queries as with the boolean model. Requiring presence of absence of terms
  4. Scoring Phrases of words difficult
  5. Wild-cards

Extended Boolean Model (p-Norm model)

Falls in the arena of Fuzzy set theory

Similar to boolean retrieval model, except adjust strictness of logical operators with a p-value.

p = infinity --> Boolean model

p = 1 --> Vector space

Spectrum of operator strictness in between..

 

A good list of comparisons: (http://www.scils.rutgers.edu/~aspoerri/InfoCrystal/Ch_2.html)

image

image 

Upcoming posts on IR:

More on preprocessing, indexing, approximate matches, index compression

Optimizations for VS model

ML algorithms for IR: clustering and classification,EM, dimensionality reduction, probabilistic models, HMMs, BayesNet

Maybe get into NLP