Research Talk: Faster Upper Bounding Of Intersection Sizes- (Daisuke Takuma and Hiroki Yanagisawa, IBM Research Tokyo - Japan)

FASTER UPPER BOUNDING OF
INTERSECTION SIZES
IBM Research Tokyo - Japan
Daisuke Takuma
Hiroki Yanagisawa
© 2013 IBM
Outline
• Motivation of upper bounding
• Contributions
– Data structure
• Proposal
• Evaluation (introduce related work here)
– Application
• Evaluation
• Conclusion
Set intersection
Computation of the common elements in given sets:
B
A
1
10
5
3
8
11
2
18
7
9
6
Intersection: A∩B = {2, 5, 7}
Intersection size: |A∩B| = 3
Applications:
•Multi-word search
•JOIN operation
:
Recently the intersection
operation is used in a
different way.
Some apps compute intersections only for their sizes.
Top-k term query
Input: the hit docs for a search query
Output: the k most frequent terms in the hit docs
Document sets
Top-k terms (k=3)
MILES
DOOR
DRIVING
MILES
Hit docs
for a search query
Ex. “ENGINE”
HIGHWAY
DRIVING
HIGHWAY
GEAR
Determined by the
intersection sizes
Intersection sizes are evaluated in inequalities:
|A∩B| > threshold, but mostly |A∩B| ≪ threshold.
In 80% intersections in top-k term queries:
|A∩B| < 0.05 × threshold
|A∩B|
≦
threshold
≦
≦
Upper bound of |A∩B|
If this holds, we can
skip the computation of
the exact intersection
Our contributions:
• We proposed data structures Cardinality Filters (CF) to
quickly compute upper bounds of intersection sizes.
• We accelerated the top-k term query using CFs
approximately by factor of 2.
•
CFs are applicable to all the algorithm evaluating
|A∩B| ≧ threshold
by adding the condition for the upper bound.
upperbound(|A∩B|) ≧ threshold AND
|A∩B| ≧ threshold
It does not change the logic.
Cardinality Filters
Requirements for CFs
Sets A, B ∈ 2X
X
A
B
A∩B (set intersection)
Set size function: |・|
CF conversion Φ |A∩B| ≦ |Φ(A) ∩ Φ(B)|
CF size function: |・|
Φ(A)
Φ(B)
Φ(A)∩ Φ(B) (CF intersection)
Cardinality filters (CFs) Φ(A), Φ(B) ∈ Φ(2X)
SCF: Single Cardinality Filter
Φ(A) = (h(A), c(A)) where
h(A) hash values, c(A): non-smallest values in collisions
7 8
10
12
14
A
0
Φ(A)
2
h(A)
c(A): 10 14
|Φ(A)∩Φ(B)| =
|h(A)∩h(B)|=3
0
Φ(B)
4
1
2
+
|c(A)∩c(B)|=2
4
h(B)
c(B): 7 10 11 14
Bit-wise AND
Small
B
0
2 3
5
7
10 11
14
Formal definition of SCF
Let X be a universe set (A, B ⊂ X),
and H be a hash value space H={ 0, 1, N, [|X|/N] - 1},
we define SCF Φ: 2X (2H, 2X) by
Φ(A) = (h(A), c(A)) (A ⊂ X)
where
h(A) = { h(x) | x ∈ A }
c(A) = { y∈A | (∃x∈A) (h(x) = h(y), x < y) }
•Directly used in the proof.
•The proof has no conditional branches.
Thm:
|A∩B| ≦ |Φ(A)∩Φ(B)| := |h(A)∩h(B)| + |c(A)∩c(B)|
RCF: Recursive Cardinality Filter
The RCF recursively applies the SCF to c(A).
A
L=3
H1(A), C1(A)
H2(A), C2(A)
H3(A), C3(A)
h1(A)
c1(A)
h2(c1(A))
c2(c1(A))
h3(c2(c1(A)))
c3(c2(c1(A)))
|Φ(A)∩Φ(B)|=Σ1~L |Hi(A)∩Hi(B)| + |CL(A)∩CL(B)|
We compared CFs with the following algorithms:
• Exact intersections
– Linear merge: comparison between sorted lists.
– Binary search: binary search x∈A in B (|A| < |B|).
– DK algorithm (VLDB 2011):
• Intersect h(A) and h(B) by bit-wise AND (h:hash)
• For all x ∈ h (A)∩h(B),
intersect h-1(x)∩A and h-1(x)∩B.
Candidates
• Naïve upper bounding
– Bloom filter: Test x∈A in the Bloom filter of B: h(B).
Performance studies:
Size
(B) Middle (100K x 100K) - no correation
LM
RCF
SCF
BF
DK2H
DK1H
BS
LM
0
Asymmetry
Correlation
(D) Asymmetric (1M x 10K) - no correlation
(F) Middle (100K x 100K) - negative correlation
traditional DK naïve ours
RCF
SCF
BF
120
100
80
60
40
20
0
DK2H
3000
2500
2000
1500
1000
500
0
DK1H
RCF
SCF
BF
DK2H
DK1H
3
2.5
2
1.5
1
0.5
0
BS
3000
2500
2000
1500
1000
500
0
BS
(E) Middle (100K x 100K) - positive correlation
LM
RCF
SCF
BF
DK2H
DK1H
BS
6
5
4
3
2
1
0
LM
1500
1250
1000
750
500
250
0
RCF
2
SCF
4
BF
6
30
25
20
15
10
5
0
DK2H
8
150
125
100
75
50
25
0
DK1H
10
RCF
SCF
(C) Small (10K x 10K) - no correlation
12
1500
3
2.5 1250
1000
2
1.5 750
500
1
0.5 250
0
0
BF
DK2H
LM
15000
12500
10000
7500
5000
2500
0
DK1H
Time
Approximation ratio
BS
Approx. ratio
LM
(A) Large (1M x 1M) - no correlation
BS
Time
[microsec]
Time complexity and performance break down
DK: O( (|A|+|B|)/√w + |A∩B| )
RCF: O( (|A|+|B|)/w )
(With a practical setting of h)
|A∩B| is the barrier for all
the exact algorithms.
The benefit of considering
the upper bounds.
Algorithms
DK1H
DK2H
Bit-wise AND
SCF
Other computations
RCF
0
500
Time [microsec]
1000
CFs maximize the utilization of bit-wise AND.
Improvement of the top-k term query
The intersection sizes of CFs are evaluated before
evaluation of the exact intersection sizes.
Term CFs
S: the doc IDs for
the search query
P[i] The doc IDs for term i (i=0, 1, N)
h1
Query CF
(Candidate CFs)
h1
h2
The compression ratio
for the n-th term:
Safe factor ×
|P[k-1]| / |P[n]|
h2
h1
h2
h5
h2
h5
h5
Index
Intersect the CFs using the same hash functions.
Improvement of top-k term queries
Computation time
[millisec]
LM
LM+BS
DK1H
SCF
RCF
800
600
400
Ours
200
0
100
1000
10000
The number of hit documents
100000
More than 80% of the exact intersections were skipped.
1
SCF
RCF
0.8
0.7
0.6
The number of hit documents
100000
50000
20000
10000
5000
2000
1000
500
200
0.5
100
Skip ratio
0.9
Summary
• We proposed Cardinality Filters (SCF, RCF) to quickly
compute upper bounds of intersection sizes.
– Property: |A∩B| ≦ |Φ(A)∩Φ(B)|
– Performance improvement from the exact algorithms.
DK:
O( (|A|+|B|)/√w + |A∩B| )
RCF: O( (|A|+|B|)/w )
• We accelerated the top-k term query using the CFs by
factor of 2.
Future work
• Using other algebraic properties of SCF and RCF:
– Monotonicity: A⊂B ⇒ |Φ(A)| ≦ |Φ(B)|
– Commutative and associative laws of Φ(A)∩Φ(B)
• Implementation of CFs having Union-like operation
• Non-parametric CFs (compression ratio of h)
Faster Upper Bounding of Intersection Sizes
Thank you !!