FASTER UPPER BOUNDING OF INTERSECTION SIZES IBM Research Tokyo - Japan Daisuke Takuma Hiroki Yanagisawa © 2013 IBM Outline • Motivation of upper bounding • Contributions – Data structure • Proposal • Evaluation (introduce related work here) – Application • Evaluation • Conclusion Set intersection Computation of the common elements in given sets: B A 1 10 5 3 8 11 2 18 7 9 6 Intersection: A∩B = {2, 5, 7} Intersection size: |A∩B| = 3 Applications: •Multi-word search •JOIN operation : Recently the intersection operation is used in a different way. Some apps compute intersections only for their sizes. Top-k term query Input: the hit docs for a search query Output: the k most frequent terms in the hit docs Document sets Top-k terms (k=3) MILES DOOR DRIVING MILES Hit docs for a search query Ex. “ENGINE” HIGHWAY DRIVING HIGHWAY GEAR Determined by the intersection sizes Intersection sizes are evaluated in inequalities: |A∩B| > threshold, but mostly |A∩B| ≪ threshold. In 80% intersections in top-k term queries: |A∩B| < 0.05 × threshold |A∩B| ≦ threshold ≦ ≦ Upper bound of |A∩B| If this holds, we can skip the computation of the exact intersection Our contributions: • We proposed data structures Cardinality Filters (CF) to quickly compute upper bounds of intersection sizes. • We accelerated the top-k term query using CFs approximately by factor of 2. • CFs are applicable to all the algorithm evaluating |A∩B| ≧ threshold by adding the condition for the upper bound. upperbound(|A∩B|) ≧ threshold AND |A∩B| ≧ threshold It does not change the logic. Cardinality Filters Requirements for CFs Sets A, B ∈ 2X X A B A∩B (set intersection) Set size function: |・| CF conversion Φ |A∩B| ≦ |Φ(A) ∩ Φ(B)| CF size function: |・| Φ(A) Φ(B) Φ(A)∩ Φ(B) (CF intersection) Cardinality filters (CFs) Φ(A), Φ(B) ∈ Φ(2X) SCF: Single Cardinality Filter Φ(A) = (h(A), c(A)) where h(A) hash values, c(A): non-smallest values in collisions 7 8 10 12 14 A 0 Φ(A) 2 h(A) c(A): 10 14 |Φ(A)∩Φ(B)| = |h(A)∩h(B)|=3 0 Φ(B) 4 1 2 + |c(A)∩c(B)|=2 4 h(B) c(B): 7 10 11 14 Bit-wise AND Small B 0 2 3 5 7 10 11 14 Formal definition of SCF Let X be a universe set (A, B ⊂ X), and H be a hash value space H={ 0, 1, N, [|X|/N] - 1}, we define SCF Φ: 2X (2H, 2X) by Φ(A) = (h(A), c(A)) (A ⊂ X) where h(A) = { h(x) | x ∈ A } c(A) = { y∈A | (∃x∈A) (h(x) = h(y), x < y) } •Directly used in the proof. •The proof has no conditional branches. Thm: |A∩B| ≦ |Φ(A)∩Φ(B)| := |h(A)∩h(B)| + |c(A)∩c(B)| RCF: Recursive Cardinality Filter The RCF recursively applies the SCF to c(A). A L=3 H1(A), C1(A) H2(A), C2(A) H3(A), C3(A) h1(A) c1(A) h2(c1(A)) c2(c1(A)) h3(c2(c1(A))) c3(c2(c1(A))) |Φ(A)∩Φ(B)|=Σ1~L |Hi(A)∩Hi(B)| + |CL(A)∩CL(B)| We compared CFs with the following algorithms: • Exact intersections – Linear merge: comparison between sorted lists. – Binary search: binary search x∈A in B (|A| < |B|). – DK algorithm (VLDB 2011): • Intersect h(A) and h(B) by bit-wise AND (h:hash) • For all x ∈ h (A)∩h(B), intersect h-1(x)∩A and h-1(x)∩B. Candidates • Naïve upper bounding – Bloom filter: Test x∈A in the Bloom filter of B: h(B). Performance studies: Size (B) Middle (100K x 100K) - no correation LM RCF SCF BF DK2H DK1H BS LM 0 Asymmetry Correlation (D) Asymmetric (1M x 10K) - no correlation (F) Middle (100K x 100K) - negative correlation traditional DK naïve ours RCF SCF BF 120 100 80 60 40 20 0 DK2H 3000 2500 2000 1500 1000 500 0 DK1H RCF SCF BF DK2H DK1H 3 2.5 2 1.5 1 0.5 0 BS 3000 2500 2000 1500 1000 500 0 BS (E) Middle (100K x 100K) - positive correlation LM RCF SCF BF DK2H DK1H BS 6 5 4 3 2 1 0 LM 1500 1250 1000 750 500 250 0 RCF 2 SCF 4 BF 6 30 25 20 15 10 5 0 DK2H 8 150 125 100 75 50 25 0 DK1H 10 RCF SCF (C) Small (10K x 10K) - no correlation 12 1500 3 2.5 1250 1000 2 1.5 750 500 1 0.5 250 0 0 BF DK2H LM 15000 12500 10000 7500 5000 2500 0 DK1H Time Approximation ratio BS Approx. ratio LM (A) Large (1M x 1M) - no correlation BS Time [microsec] Time complexity and performance break down DK: O( (|A|+|B|)/√w + |A∩B| ) RCF: O( (|A|+|B|)/w ) (With a practical setting of h) |A∩B| is the barrier for all the exact algorithms. The benefit of considering the upper bounds. Algorithms DK1H DK2H Bit-wise AND SCF Other computations RCF 0 500 Time [microsec] 1000 CFs maximize the utilization of bit-wise AND. Improvement of the top-k term query The intersection sizes of CFs are evaluated before evaluation of the exact intersection sizes. Term CFs S: the doc IDs for the search query P[i] The doc IDs for term i (i=0, 1, N) h1 Query CF (Candidate CFs) h1 h2 The compression ratio for the n-th term: Safe factor × |P[k-1]| / |P[n]| h2 h1 h2 h5 h2 h5 h5 Index Intersect the CFs using the same hash functions. Improvement of top-k term queries Computation time [millisec] LM LM+BS DK1H SCF RCF 800 600 400 Ours 200 0 100 1000 10000 The number of hit documents 100000 More than 80% of the exact intersections were skipped. 1 SCF RCF 0.8 0.7 0.6 The number of hit documents 100000 50000 20000 10000 5000 2000 1000 500 200 0.5 100 Skip ratio 0.9 Summary • We proposed Cardinality Filters (SCF, RCF) to quickly compute upper bounds of intersection sizes. – Property: |A∩B| ≦ |Φ(A)∩Φ(B)| – Performance improvement from the exact algorithms. DK: O( (|A|+|B|)/√w + |A∩B| ) RCF: O( (|A|+|B|)/w ) • We accelerated the top-k term query using the CFs by factor of 2. Future work • Using other algebraic properties of SCF and RCF: – Monotonicity: A⊂B ⇒ |Φ(A)| ≦ |Φ(B)| – Commutative and associative laws of Φ(A)∩Φ(B) • Implementation of CFs having Union-like operation • Non-parametric CFs (compression ratio of h) Faster Upper Bounding of Intersection Sizes Thank you !!