Our focus in the database world has primarily been on retrieving information from a structured source, through a structured query.
Relational DBs --> structured data, structured query
XML DBs --> semi-structured data, structured query
search --> unstructured data, unstructured query
Actually, search --> data with hidden structure, query with possibly hidden structure
Recent efforts on IR considers data with no explicit structure:
Traditional Preprocessing steps
Basic Retrieval Models
Document Representation: Set of terms
Queries: Exact matches on presence of terms using logical operators AND, NOT, OR,..
No partial matches or ranking
Reasonably efficient implementations - think set operations in relational DBs queries
Rigid: hard to control: # of docs retrieved, rankings, relevance feedback
Document Representation: Bag of terms (with frequency)
Queries: Set of desired terms with optional weights
t distinct terms remain after preprocessing step, forming a vector space:
Dimension = t = |vocabulary|
Weight assigned to terms in the document, query (i, j): w_i,j
d_j = (w_1j, w_2j, …, w_tj)
Graphically..
Term-document matrix:
Each real-valued weight corresponds to the degree of significance of that term in the document. 0=no significance I, we, the, ..) or that it does not exist.
Term frequency (tf): tf_ij = frequency of i in j divided by frequency of most common term in j
Document frequency (df): number of documents containing term
Inverse document frequency of i: log_2 ( total # documents / df_i) Log dampens the effect relative to tf
tf-idf weighing commonly used to determine term weight: w_ij = tf_ij idf_i = tf_ij log2 (N/ df_i)
A term common across all documents is given less importance, and rare terms occurring in a document are given higher weight
Query is also treated as a document vector. Now the problem is to find similar documents
Need a way to compute degree of similarity between two vectors: Dot product!
Problems: Favors long documents. Does not measure how many terms don't match.
Compute similarity based on Cosine of the angle between two vectors.
With V vocabulary and D documents, complexity is O( |V| * |D|) !!
Key points:
Overall problems with Vector Space:
Falls in the arena of Fuzzy set theory
Similar to boolean retrieval model, except adjust strictness of logical operators with a p-value.
p = infinity --> Boolean model
p = 1 --> Vector space
Spectrum of operator strictness in between..
A good list of comparisons: (http://www.scils.rutgers.edu/~aspoerri/InfoCrystal/Ch_2.html)
Upcoming posts on IR:
More on preprocessing, indexing, approximate matches, index compression
Optimizations for VS model
ML algorithms for IR: clustering and classification,EM, dimensionality reduction, probabilistic models, HMMs, BayesNet
Maybe get into NLP