Welcome to MSDN Blogs Sign in | Join | Help

Locality Sensitive Hashing (LSH) and Min Hash

[Indyk-Motwani’98]

Many distance related questions (nearest neighbor, closest x, ..) can be answered more efficiently by using locality sensitive hashing, where the main idea is that similar objects hash to the same bucket.

 

LSH function:

Probability of collision higher for similar objects

Hash data using several LSH functions

At query time, get all objects in the bucket where query object hashes to. Then compute similarity and find the closest one.

Main point: LSH reduces the candidates for similarity comparison

 

Example of LSH function:

Set S of d integers.

h(S) = list of integers at k random indexes (r)

S = {12, 9, 13, 98 , 2, 4, 43, 21, 53, 99}

k=3, r={3, 5, 9}

h(s) = 98499

Similar sets have high probability of hashing to 98499.

Use multiple different LSH functions, get a union of results as candidates.

P(h(p)=h(q)) looks like this:

image

By controlling parameters, k and n the number of hash functions we can control steepness and position of the curve.

 

LSH with Min Hash

Simple trick: the hash functions of MinHash can serve as LSH functions.

Recall min hash signature matrix has Sets (column) vs hash functions (rows).

Partition k hash functions into bands b bands or r hash functions. (br=k)

b bands --> n LSH functions

r --> k

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace SetSimilarity
{
    class LSH<T>
    {
        int m_numHashFunctions;
        int m_numBands;
        Dictionary<int, HashSet<int>> m_lshBuckets = new Dictionary<int, HashSet<int>>();
        HashSet<T>[] m_sets;
        int[,] m_minHashMatrix;
        const int ROWSINBAND = 5;

        //first index is Set, second index contains hashValue (so index is hash function)
        public LSH(int[,] minHashMatrix, HashSet<T>[] sets)
        {
            m_numHashFunctions = minHashMatrix.Length/sets.Length;
            m_numBands=  m_numHashFunctions / ROWSINBAND;
            m_sets = sets;
            m_minHashMatrix = minHashMatrix;

            for (int s = 0; s < sets.Length; s++)
            {
                for (int b = 0; b < m_numBands; b++)
                {
                    //combine all 5 MH values and then hash get its hashcode
                    //need not be sum
                    int sum = 0;

                    for (int i = 0; i < ROWSINBAND; i++)
                    {
                        sum += minHashMatrix[s, b*ROWSINBAND+i];
                    }

                    if(m_lshBuckets.ContainsKey(sum))
                    {
                        m_lshBuckets[sum].Add(s);
                    }
                    else
                    {
                        var set = new HashSet<int>();
                        set.Add(s);
                        m_lshBuckets.Add(sum, set);
                    }
                }
            }
        }

        public int FindClosest(int setIndex, MinHash minHasher)
        {
            //First find potential "close" candidates
            HashSet<int> potentialSetIndexes = new HashSet<int>();

            for (int b = 0; b < m_numBands; b++)
            {
                //combine all 5 MH values and then hash get its hashcode
                int sum = 0;

                for (int i = 0; i < ROWSINBAND; i++)
                {
                    sum += m_minHashMatrix[setIndex, b * ROWSINBAND + i];
                }

                foreach (var i in m_lshBuckets[sum])
                {
                    potentialSetIndexes.Add(i);
                }
            }

            //From the candidates compute similarity using min-hash and find the index of the closet set
            int minIndex = -1;
            double similarityOfMinIndex = 0.0;
            foreach (int candidateIndex in potentialSetIndexes.Where(i => i != setIndex))
            {
                double similarity = minHasher.ComputeSimilarity(m_minHashMatrix, setIndex, candidateIndex);
                if (similarity > similarityOfMinIndex)
                {
                    similarityOfMinIndex = similarity;
                    minIndex = candidateIndex;
                }
            }

            return minIndex;
        }
    }
}

For some (large) number of sets with some (large) number of values, computing hash matrix is expensive, but after that finding the closes set takes 10000 times less than doing so by computing hamming distance using the Jaccard measure. Of course, one can compute n^2, distances beforehand, and at query-time do only n comparisons. Depending on the LSH design that may or may not take less time than fn fraction comparisons (because the constant overhead factor is larger).

The other advantage of MinHash+LSH is to reduce size of the signature - i.e size of data that is necessary to maintain (or communicate) in order to ask distance-related questions.

Set Similarity and Min Hash

Given two sets S1, S2, find similarity(S1, S2) - based not hamming distance (not Euclidean).

Jaccard Measure

View sets at a bit-array. Indexes representing each possible element, and 1/0 representing presence/absence of the element in the set.

Then Jaccard measure = eq= \frac{|S_1 \cap S_2|}{|S_1 \cup S_2|}

What happens when: n element in each set from a possible universe u, s.t. n << u?

Ok, as long as just |S1 U S2| is not too large.

Implementation is straightforward (In C#)

class JaccardSimilarity 
{ 
    public static double Similarity<T>(HashSet<T> set1, HashSet<T> set2) 
    { 
        int intersectionCount = set1.Intersect(set2).Count(); 
        int unionCount = set1.Union(set2).Count(); 

        return (1.0 * intersectionCount) / unionCount; 
    } 
}

Intersection: O(nlogn) with sort-merge join, or O(n) with a big constant using hash join.

Union: O(n), again with some overhead.

Space is also O(n) at best.

 

Hash similarity

Find a hash function sig (signature) such that sim(S1, S2) is approximated by sim(sig(s1), sig(s2))

Idea 1: Sample P elements from the universe, and define sig(S1) as bits for P elements (i.e reduce the sets to a random sample of the universe).

But problems with sparsity (n << u)

Idea 2: So don't count entries that are absent in both sets. E.g:

Four combinations:

A = 1, 1 (Element present in both sets)
B = 0, 1
C = 1, 0
D = 0, 0

Count: A / A+B+C

 

  E1 E2 E3 E4 E5 E6
S1 1 1 0 0 0 0
S1 1 0 1 0 0 0

sim(S1, S2) = 1 / 3

 

Min Hash

Combine ideas 1 and 2.

Randomly permute element order (columns). Hash of S is element number of first element that is present in the bit-array.

Key insight:
P(h(s1) = h(s2)) = jaccardsimilarity(s1, s2)

Why? Both measures are A/A+B+C

This about this probabilistically..

 

Too fragile with a single permutation. Create min-hash signature (instead of a single integer) using N random permutations.

Then mhsig(S) = list of n indexes where element is present h(S)

Now, similarity(S1, S2) using min-hash is: Fraction of permutations where mhsig(S1) = mhsig(S2)

Expectation of similarity now is same as jaccard similarity measure.

 

We still can not implement this efficiently! Luckily there are some nifty tricks..

Instead of permuting rows n times, use n different hash functions and the let hash index order provide the random permutation.

But where is the min part in min-hash?

Foreach(set S)

  Foreach(hash function h)

       Find elements with that are present in S.

           Compute hash of the element index if element is present

           Keep the hash with minimum index value

This will give you the index of the first 1 from a random permutation

 

Computation cost of min hash is clearly higher, then why bother?

Answer: MinHash gives you a small signature of a (potentially large) set which can be used for similarity testing.

Summary: Signature generation time may be large, but querying time is smaller and takes less space.

A common use is in Nearest Neighbor type problem, where there are S Sets, and you wish to find the nearest one (or k nearest).

Compare all signature-pairs? Hint: use Locality Sensitive Hashing to eliminate pairs with dissimilar signatures -- Maybe my next post.

Here is the code..

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;

namespace SetSimilarity
{
    class MinHash
    {
        private const int m_numHashFunctions = 100; //Modify this parameter
        private delegate int Hash(int index);
        private Hash[] m_hashFunctions;

        public MinHash(int universeSize)
        {
            Debug.Assert(universeSize > 0);

            m_hashFunctions = new Hash[m_numHashFunctions];

            Random r = new Random(11);
            for (int i = 0; i < m_numHashFunctions; i++)
            {
                uint a = (uint)r.Next(universeSize);
                uint b = (uint)r.Next(universeSize);
                uint c = (uint)r.Next(universeSize);
                m_hashFunctions[i] = x => QHash((uint)x, a, b, c, (uint)universeSize);
            } 
        }

        public double Similarity<T>(HashSet<T> set1, HashSet<T> set2)
        {
            Debug.Assert(set1.Count > 0 && set2.Count > 0);

            int numSets = 2;
            Dictionary<T, bool[]> bitMap = BuildBitMap(set1, set2);
            
            int[,] minHashValues = GetMinHashSlots(numSets, m_numHashFunctions);

            ComputeMinHashForSet(set1, 0, minHashValues, bitMap);
            ComputeMinHashForSet(set2, 1, minHashValues, bitMap);

            return ComputeSimilarityFromSignatures(minHashValues, m_numHashFunctions);
        }

        private void ComputeMinHashForSet<T>(HashSet<T> set, short setIndex, int[,] minHashValues, Dictionary<T, bool[]> bitArray)
        {
            int index = 0;
            foreach (T element in bitArray.Keys)
            {
                for (int i = 0; i < m_numHashFunctions; i++)
                {
                    if(set.Contains(element))
                    {
                        int hindex = m_hashFunctions[i](index);

                        if (hindex < minHashValues[setIndex, i])
                        {
                            minHashValues[setIndex, i] = hindex;
                        }
                    }
                }
                index++;
            }
        }

        private static int[,] GetMinHashSlots(int numSets, int numHashFunctions)
        {
            int[,] minHashValues = new int[numSets, numHashFunctions];

            for (int i = 0; i < numSets; i++)
            {
                for (int j = 0; j < numHashFunctions; j++)
                {
                    minHashValues[i, j] = Int32.MaxValue;
                }
            }
            return minHashValues;
        }

        private static int QHash(uint x, uint a, uint b, uint c, uint bound)
        {
            //Modify the hash family as per the size of possible elements in a Set
            int hashValue = (int)((a * (x >> 4) + b * x + c) & 131071);
            return Math.Abs(hashValue);
        }

        private static Dictionary<T, bool[]> BuildBitMap<T>(HashSet<T> set1, HashSet<T> set2)
        {
            Dictionary<T, bool[]> bitArray = new Dictionary<T, bool[]>();
            foreach (T item in set1)
            {
                bitArray.Add(item, new bool[2] { true, false });
            }

            foreach (T item in set2)
            {
                bool[] value;
                if (bitArray.TryGetValue(item, out value))
                {
                    //item is present in set1
                    bitArray[item] = new bool[2] { true, true };
                }
                else
                {
                    //item is not present in set1
                    bitArray.Add(item, new bool[2] { false, true });
                }
            }
            return bitArray;
        }

        private static double ComputeSimilarityFromSignatures(int[,] minHashValues, int numHashFunctions)
        {
            int identicalMinHashes = 0;
            for (int i = 0; i < numHashFunctions; i++)
            {
                if (minHashValues[0, i] == minHashValues[1, i])
                {
                    identicalMinHashes++;
                }
            }
            return (1.0 * identicalMinHashes) / numHashFunctions;
        }

    }
}

My code is public, feel free to copy and optimize.

(Rest based on lecture notes of Rajeev Motwani, which in turn is based on notes from Jeff Ulman)

Information Retrieval &amp; Search - Basic IR Models

Our focus in the database world has primarily been on retrieving information from a structured source, through a structured query.

Relational DBs --> structured data, structured query

XML DBs --> semi-structured data, structured query

search --> unstructured data, unstructured query

Actually, search --> data with hidden structure, query with possibly hidden structure

Recent efforts on IR considers data with no explicit structure:

  • Indexing and and retrieval of information-containing objects (text documents, web-pages, image, audio, ..)
    • Want only relevant objects
    • Efficiently from a large set
    • Lets focus mainly on text based documents in this post
  • Related areas
    • DB Management
    • Artificial intelligence
      • Knowledge representation and intelligent action
    • NLP and linguistics
      • linguistic analysis of natural language text
    • Machine learning
      • Systems that improve performance with experience

Traditional Preprocessing steps

  1. Strip irrelevant markup, tags, punctuation
  2. Tokenize
  3. Stem tokens to root words
    1. Porter stemmer: set or local rules to reduce word to its root.
    2. Words with distinct meanings may reduce to the same token
    3. Does not recognize all morphological derivations
    4. (arm, army) --> arm, (police, policy) --> polic
  4. Remove stop words: I, a, the, is..
  5. Detect common phrases
  6. Build Inverted index: keywords --> docs

Basic Retrieval Models

  1. Set-theoretic
    1. Boolean model
    2. Extended Boolean
  2. Algebraic
    1. Vector Space
    2. Extended Boolean
    3. Generalized VS
    4. Latent semantic, neural network, etc.
  3. Probabilistic
    1. Inference/Belief Network

 

Boolean Model

Document Representation: Set of terms

Queries: Exact matches on presence of terms using logical operators AND, NOT, OR,..

No partial matches or ranking

Reasonably efficient implementations - think set operations in relational DBs queries

Rigid: hard to control: # of docs retrieved, rankings, relevance feedback

Vector Space Model

Document Representation: Bag of terms (with frequency)

Queries: Set of desired terms with optional weights

 

t distinct terms remain after preprocessing step, forming a vector space:

Dimension = t = |vocabulary|

Weight assigned to terms in the document, query (i, j): w_i,j

d_j = (w_1j, w_2j, …, w_tj)

Graphically..

image

Term-document matrix:

image

Each real-valued weight corresponds to the degree of significance of that term in the document. 0=no significance I, we, the, ..) or that it does not exist.

Term frequency (tf): tf_ij  = frequency of i in j  divided by frequency of most common term in j

Document frequency (df): number of documents containing term

Inverse document frequency of i: log_2 ( total # documents / df_i)  Log dampens the effect relative to tf

tf-idf weighing commonly used to determine term weight: w_ij = tf_ij idf_i = tf_ij log2 (N/ df_i)

A term common across all documents is given less importance, and rare terms occurring in a document are given higher weight

Query is also treated as a document vector. Now the problem is to find similar documents

Need a way to compute degree of similarity between two vectors: Dot product!

image

Problems: Favors long documents. Does not measure how many terms don't match.

Compute similarity based on Cosine of the angle between two vectors.

image

With V vocabulary and D documents, complexity is O( |V| * |D|) !!

 

Key points:

  1. Query viewed as a document
  2. Measure proximity to it (numerous methods with different optimizations have been proposed)
  3. Measures score/ranking, not just logical matches

Overall problems with Vector Space:

  1. Missing semantic, word-sense type information
  2. Missing Syntactic information - phrase structure, word order, proximity
  3. Does not support boolean queries as with the boolean model. Requiring presence of absence of terms
  4. Scoring Phrases of words difficult
  5. Wild-cards

Extended Boolean Model (p-Norm model)

Falls in the arena of Fuzzy set theory

Similar to boolean retrieval model, except adjust strictness of logical operators with a p-value.

p = infinity --> Boolean model

p = 1 --> Vector space

Spectrum of operator strictness in between..

 

A good list of comparisons: (http://www.scils.rutgers.edu/~aspoerri/InfoCrystal/Ch_2.html)

image

image 

Upcoming posts on IR:

More on preprocessing, indexing, approximate matches, index compression

Optimizations for VS model

ML algorithms for IR: clustering and classification,EM, dimensionality reduction, probabilistic models, HMMs, BayesNet

Maybe get into NLP

Posted by sahilthaker | 1 Comments

Information Theory (1) - The Science of Communication

IT is a beautiful sub-field of CS with applications across the gamut of scientific fields: coding theory and communications (under unreliable channels), cryptography, physics, biomedical engineering, computer graphics, machine learning, statistics, and even gambling and stock trading. It is truly a marvelous through process.

Framework:

  • "Bit" as the unit of communication. Arbitrary. Ok as long as all analysis is consistent.
  • Two parties: A and B. An event occurs at one end, A, who must "communicate" the outcome to B.
    • What are the minimum "units" required to communicate?
    • The Key insight from Shannon: it depends on apriori belief distribution about the event
    • That is, "uncertainty" with respect to the event. If an outcome is absolute or impossibly, why communicate at all?
  • For instance, a sequence of biased coin flips contains less information (less degree of uncertainty)  than a sequence of unbiased coin flips, so it should take fewer bits to transmit
  • Keep in mind that if the likelihood of n outcomes is not equally likely, we can use variable length encoding to use less bits to transmit more likely outcome.

Information Content / Degree of Uncertainty / "Entropy"

Entropy of R.V. X is denoted by H(X)

image

                =   E[log( 1/p(X) )]

Intuition behind the formula:

  • Lets say all n events are equally likely and P(x) = p for all x.
    • Then, -log(p) = log(1/p), which is log(n).
  • But since the actual distribution may not be uniform, take the weighted average of information content of each possible outcome - weighted by the probability of the outcome occurring.
    • Thus, more frequent values can be encoded by smaller length messages.

 

Basic Results to Remember

  • Joint Entropy:  eq=H(X,Y) = \sum_{y \in Y} \sum_{x \in X} p(x,y) \; log \: p(x,y)
    • if X, Y are independent events, joint entropy is H(X) + H(Y)
    • Why? Need to communicate both, Duh!
  • Conditional Entropy:

    eq=H(Y|X) = \sum_{x \in X}^{} p(x) H(Y|X=x)

                   eq== - \sum_{x \in X} \sum_{y \in Y} p(x, y) \; log\; p(y|x)

    • How much extra information you still need to supply on avg to communicate Y, given that the outcome of X is known
  • Chain Rule:
    • H(X, Y) = H(X) + H(Y|X)   = H(Y) + H(X|Y)
    • H(X1 .., Xn) = .... just expand
  • Mutual Information:  I(X;Y) = H(X) - H(X|Y)
    • "Reduction" in uncertainty about one Random Variable by knowing the outcome of other
    • How far a joint distribution is from independence
    • The amount of information one R.V. contains about another
    • Note: I(X;X) = H(X) - H(X|X) = H(X)
  • Conditional Mutual Information and the chain rule
    •    I(X; (Y|Z))
    • = I( (X; Y)|Z)
    • = H(X|Z) - H(X|(Y, Z))
  • Kullback-Lieibler Divergence (Relative Entropy)
    • For two p.m.f. p(x) and q(x),
    • image  image
    • Measures how different two probability dists are over the same event space.
    • Avg number of bits that are wasted by encoding events from dist p with code based on dist q.
    • D(p||q)=0 iff p=q
    • Often seen as "distance" but not quite right because i) not symmetric in p an q, and ii) does not satisfy: d(x,y) <= d(x,z) + d(z, y)
  • Cross Entropy
    • Let p be the true dist, q be the model dist, then Cross entropy is defined as:
    • \mathrm{H}(p, q) = \mathrm{E}_p[-\log q] = \mathrm{H}(p) + D_{\mathrm{KL}}(p \| q)\!,
    • Avg number of bits needed to communicate events when coding scheme is based on model dist q rather than true dist p
    • Cross entropy minimization is typically used when a model dist q is to be designed (often without knowing true dist p)

 

Not so basic results and some applications of IT in Part 2 ..

Random Sampling over Joins

Source: On Random Sampling over Joins. Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, Sigmod 1999.

What?

  • Random sampling as a primitive relational operator: SAMPLE(R, f) where R is the relation and f the sample fraction.
  • SAMPLE(Q, f) is a tougher problem, where Q is a relation produced by a query
  • In particular, focus on sampling over a join operator
  • Can be generalize to arbitrarily deep join trees

Motivation:

  • Data mining scenarios - CUBE, OLAP, stream queries- need to sample a query rather than evaluate it
  • Statistical analysis when dealing with massive data
  • Massively distributed computing (information storage/retrieval) scenarios

Details:

  • Sampling Methods:
    1. With Replacement (WR),
    2. Without Replacement (WOR),
    3. Coin Flips (CF)
  • Conversion between methods is straightforward, as per my previous note.
  • Dimensions:
    • Sequential stream (critical for efficiently) vs random access on a materialized relation
    • Indexes, vs Stats vs no information
    • Weighted vs Unweighted sample
  • Note: Weighted, Sequential sample is the most general case

Sequential, unweighted: CF semantics - easy.
Sequential, unweighted, WOR - easy: Reservoir Sampling
Sequential, unweighted, WR - Algorithms:

Black Box U1: Relation R with n tuples, get WR sample of size r
Need to know the size of n :(
Produces samples while processing, preserves input order, O(n) time, O(1) extra memory

x = r
i = 0
while(t = stream.Next())
{
    Generate random variable X from BinomialDist(x, 1/(n-i))
    result.Add( X copies of t)
    x = x - X
    i = i++
}

Black Box U2:

No need to know n . With some modification can preserve the order
Does not produce result till the end.  O(n) time, O(r) space

N=0
Result[1..r]
while(t = stream.Next())
{
   N++
   for(j = 1 to r)
   {
       if(Rand.New(0,1) < 1/N)
           Result[j] = t
   }
}
return Result
Weighted Sampling

The above two algorithms can be easily modified for the weighted case:

Weighted U1

x = r, i = 0
W = sum of w(t), the weights for each input tuple t
while(t = stream.Next() && x>0)
{
    Generate random variable X from BinomialDist(x, w(t)/(W-i))
    result.Add( X copies of t)
    x = x - X
    i = i+ w(t)
}

Weighted U2

W=0
Result[1..r]
while(t = stream.Next())
{
   W = W + w(t)
   for(j = 1 to r)
   {
       if(Rand.New(0,1) < w(t)/W)
           Result[j] = t
   }
}
return Result
The difficulty in Join Sampling

Example: R1 = {1, 2, ..., 1000}, R2 = {1, 1, 1, ..... 1}. Unlikely that R1(1) will be sampled, and SAMPLE(R1) image SAMPLE(R2) will contain no result

  • SAMPLE does not commute with join
  • Sample tuple t from R1 with probability proportional to |R2(t)|

Algorithms

Let m1(v) denote the number of tuples in R1 that contain value v in the attribute to be used in equi-join.

Strategy Naive Sampling: produces WR samples

Compute J = R1 image R2
Use U1 or U2 to produce SAMPLE(J)

Strategy Olken Sample: produces WR samples
Requires indexes for R1 and R2

Let M be upper bound on m2(v) for all values A can take, which is essentially all rows in R2 (?)

while r tuples have not been produced
{
     Randomly pick a tuple t1 from R1 
     Randomly pick a tuple t2 from imageA=t1.A ( R2 )
     With probability  m2(t2.A)/M, output t1 image t2
}

Strategy Stream Sample   [Chaudhury, Motwani, Narasayya]
No information for R1, R2 has indexes/stats.

  1. Use a with-replacement strategy and get a sample WR (S1) from R1 WHERE tuple t (from R1) has weight m2(t.A)
  2. while(t1 = S1.next())
    {
        t2 = random sample from (SELECT t from R2 where t.A = t1.A)
        output t1 image t2
    }
  • 'non-oblivious sampling', where the distribution of R2 is used to bias the sample from R1
  • What about R1 image R2 image R3?
    • Pick non-uniform random sample for R1 image R2  whose distribution depends on R3
    • Sample from R1 using stats for R2 and R3.
  • Using the same biasing idea to push down both operand relations
    • Cross-dependent sampling strategy difficult

Other Results:

  • Not possible to commute SAMPLE over JOIN. That is, SAMPLE(R1 image R3) image SAMPLE(R1) image SAMPLE(R2)
  • Can push down SAMPLE to one side of JOIN tree by biasing the sample with respect to the other side of JOIN.

Converting Between Random Sampling Methods

 

Sampling f fraction out of n records:

  1. Sampling with replacement
    1. Sample is a multi-set of fn records. Any record could be samples multiple times.
  2. Sampling without replacement
    1. Each successive sample is uniformly at random from the remaining records
  3. Independent Coin flips: choose a record with probability f. The sample has s distinct records where s is a random variable with Binomial distribution B(n, f), and has expectation of fn elements in the set.
Conversions:
  • Sample with replacement --> Sample without replacement
    • Check upon each new selection and reject duplicates
  • Coin flips --> Sample without replacement
    • Sample a larger fraction so that at least f fractoin records are present in the sample (Chernoff bounds). Reject required number of samples to ensure we get exactly f samples.
  • Sample without replacement --> Sample with replacement
    • Sample with replacement from "without replacement" sample by using corresponding probabilities for duplication
  • Sample with replacement --> Coin flips sample
    • Impossible
  • Sample without replacement --> Coin flips sample
    • Impossible

Last two can be seen by considering an adversarial case. Under the binomial distribution, there is a non-zero probability of sampling the entire population. Since any sampled subset can never recreate the entire population, S with R and S without R can never be converted to the coin flips semantics.

Source: S. Chaudhuri, R. Motwani, V. Narasayya; On random Sampling over Joins.

Reservoir Sampling

A simple random sampling strategy to produce a sample without replacement from a stream of data - that is, in one pass: O(N)

Want to sample s instances - uniformly at random without replacement - from a population size of n records, where n is not known.

Figuring out n would require 2 passes. Reservoir sampling achieves this in 1 pass.
A reservoir R here is simply an array of size s. Let D be data stream of size n

Algorithm:

  • Store first s elements into R.
  • for each element in position k = s+1 to n ,
    • accept it with probability s/k
    • if accepted, choose a random element from R to replace.

Partial analysis:

Base case is trivial. For the k+1st case, the probability a given element i with position <= k is in R is s/k. The prob. i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.

So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps)

= s/k * k/(k+1), which is s/(k+1).

So, when k+1 = n, any element is present with probability s/n.

Distributing Reservoir Sampling

It is very simple to distribute the reservoir sampling algorithm to n nodes.

Split the data stream into n partitions, one for each node. Apply reservoir sampling with reservoir size s, the final reservoir size, on each of the partitions. Finally, aggregate each reservoir into a final reservoir sample by carrying out reservoir sampling on them.

Lets say you split data of size n into 2 nodes, where each partition is of size n/2. Sub-reservoirs R1 and R2 are each of size s.

Probability that a record will be in sub-reservoir is:
s / (n/2) = 2s/n

The Probability that a record will end up in the final reservoir given it is in a sub-reservoir is: s/(2s) = 1/2.

It follows that the probability any given record will end up in the final reservoir is:
2s/n * 1/2 = s/n

 
Page view tracker