Source: On Random Sampling over Joins. Surajit Chaudhuri, Rajeev Motwani, Vivek Narasayya, Sigmod 1999.
What?
Motivation:
Sequential, unweighted: CF semantics - easy. Sequential, unweighted, WOR - easy: Reservoir Sampling Sequential, unweighted, WR - Algorithms:
Black Box U1: Relation R with n tuples, get WR sample of size r Need to know the size of n :( Produces samples while processing, preserves input order, O(n) time, O(1) extra memory
x = r i = 0 while(t = stream.Next()) { Generate random variable X from BinomialDist(x, 1/(n-i)) result.Add( X copies of t) x = x - X i = i++ }
Black Box U2:
No need to know n . With some modification can preserve the order Does not produce result till the end. O(n) time, O(r) space
N=0 Result[1..r] while(t = stream.Next()) { N++ for(j = 1 to r) { if(Rand.New(0,1) < 1/N) Result[j] = t } } return Result
The above two algorithms can be easily modified for the weighted case:
Weighted U1
x = r, i = 0 W = sum of w(t), the weights for each input tuple t while(t = stream.Next() && x>0) { Generate random variable X from BinomialDist(x, w(t)/(W-i)) result.Add( X copies of t) x = x - X i = i+ w(t) }
Weighted U2
W=0 Result[1..r] while(t = stream.Next()) { W = W + w(t) for(j = 1 to r) { if(Rand.New(0,1) < w(t)/W) Result[j] = t } } return Result
Example: R1 = {1, 2, ..., 1000}, R2 = {1, 1, 1, ..... 1}. Unlikely that R1(1) will be sampled, and SAMPLE(R1) SAMPLE(R2) will contain no result
Let m1(v) denote the number of tuples in R1 that contain value v in the attribute to be used in equi-join.
Strategy Naive Sampling: produces WR samples
Compute J = R1 R2 Use U1 or U2 to produce SAMPLE(J)
Strategy Olken Sample: produces WR samples Requires indexes for R1 and R2
Let M be upper bound on m2(v) for all values A can take, which is essentially all rows in R2 (?)
while r tuples have not been produced { Randomly pick a tuple t1 from R1 Randomly pick a tuple t2 from A=t1.A ( R2 ) With probability m2(t2.A)/M, output t1 t2 }
Strategy Stream Sample [Chaudhury, Motwani, Narasayya] No information for R1, R2 has indexes/stats.