A Hybrid Multicast-Unicast Infrastructure for
Efficient Publish-Subscribe in
Enterprise Networks
Danny Bickson, Ezra N. Hoch, Nir Naaman and Yoav Tock
IBM Haifa Research Lab, Israel
IBM Haifa Research Lab
Outline
Motivation
The channelization problem
Our hybrid approach
Experimental results
Conclusions
2
IBM Haifa Research Lab
Motivation: large scale publish subscribe application
Large number of information flows (topics)
and subscribers
Each flow must be delivered to a subset of
interested subscribers
Example: financial market data
dissemination
Publisher divides data feed into a large number
information flows, (~100K)
e.g. stock symbols, futures, commodities
Many stand-alone subscribers (~1K)
Subscribers display interest heterogeneity are interested in different yet overlapping subsets
of the topics
Any single topic may be delivered to a large
number of subscribers (hot / cold topics)
3
Data Vendor
WAN
Subscribers
Publisher
Enterprise LAN
Multiple information
flows (Topics)
IBM Haifa Research Lab
Common approaches
Use unicast (point-to-point) connections
Limitations: poor utilization of network resources (duplicate
transmissions)
Use broadcast (single multicast channel)
Limitations: receivers filter unwanted content
Utilize multicast to transmit data
Topics are mapped into multicast groups. Each user joins the
groups that cover his topic-interest.
Reduces receiver filtering
Limitations: limited amount of multicast addresses
Network element state problem
Receiver resources (NICs)
4
IBM Haifa Research Lab
Our novel contribution
Create a hybrid approach that combines both multicast and unicast
Flexible allocation of transmissions
Topics with high interest enjoy efficiency of multicast
Topics with low interest are transmitted in unicast
Formalize as an optimization problem
Propose a two step alternating method for computing the resource
allocation
5
IBM Haifa Research Lab
The Channelization Problem
n flows
Flow rates λ
k multicast groups
m users
Interest matrix W
The task: find mapping matrices X,Y that minimizes the communication cost
The cost of transmission – take into account transmission to multiple
groups
The cost of reception – minimize excess filtering
6
IBM Haifa Research Lab
The Hybrid Channelization Problem
Flows
X – flow to
group map
Multicast Groups
G1
G2
Y – user subscription map
Users
F1
Gk
F2
U1
F1 F2
F3
U2
F1 F2 F8
F4
U3
F3 F4 F6
Fn
Um
F1 Fn
T – unicast
transmission map
7
Interest
Extraction (W)
IBM Haifa Research Lab
The Hybrid Channelization Problem
Modified cost function
Cost of multicast reception
Cost of multicast transmission
Cost of unicast
reception & transmission
Problem objective is
8
IBM Haifa Research Lab
Proposed Solution
Unfortunately the hybrid problem is NP-hard
We propose a two step heuristic solution
First step: solve the channelization problem (multicast mapping)
Second step:
Choose flow-user pairs for unicast,
Remove redundant assignments from multicast mapping
Recalculate the cost
Iterate until convergence, or unicast BW limit exceeded
9
IBM Haifa Research Lab
First step: channelization problem solution
We have experimented with the following algorithms
K-Means (2005) performs best
10
IBM Haifa Research Lab
K-Means Mapping Algorithm
Input
Interest matrix, topic rate vector
Interest Matrix =
Topics
Users
v
x
x
v v x x
x v v v
x
x
x
x
Basic insight
Put “similar” topics in the same group
“Similar” topics have a similar audience causes less filtering
User’s
Interest Vector
Rate Vector =
Topic’s
Audience Vector
R1 R2
…
RK
Take the rate into account
T1
T2
Iterative Clustering Algorithm (K-means)
Init: Topics are assigned into a fixed number of groups
Move: In each step, remove a single topic, and move it to
the best group – the one producing the lowest cost
Cost: After each epoch, compute total filtering cost
Stop: cost doesn’t improve | time elapsed | max # iter.
11
?
T3
T4
T5
T6
T5
?
T7
T8
?
T9
IBM Haifa Research Lab
Second step: choosing user-flow pairs for unicast
Experimented with several heuristics
Heavy users - all transmission to a specific heavy user is sent using
unicast
Lightweight flows - flows with low bandwidth are sent using unicast
Greedy flows - move to unicast the flow which best minimizes the
total cost
Greedy users - move to unicast the user which best minimizes the
total cost
An additional heuristic - Greedy user-flow pairs – move to unicast
the user-flow pair which best minimizes the total cost - very slow,
impractical run-time
12
IBM Haifa Research Lab
Experimental results
Construction of user-interest matrix W
Random, uniform
Market distribution – based on a model of NYSE stock volume
IBM WebSphere cell – a real system
13
IBM Haifa Research Lab
Channelization algorithms
K-Means (2005)
performs best
Takes rate into account
Gradient decent on the
true cost function
14
IBM Haifa Research Lab
Effect of the interest matrix on channelization performance
The interest and rate
have a significant effect
on channelization
performance
Some interests have
patterns that are easy to
“channelize”
Interests with less
entropy, more order, are
easier
15
IBM Haifa Research Lab
Hybrid Algorithm Heuristics
Unicast BW limit – algorithm will use optimal amount up to the limit
Market dist. - Greedy users
Can use more unicast BW
16
WebSphere dist. - Greedy flows
Doesn’t need more than 20% unicast BW
IBM Haifa Research Lab
Hybrid using greedy flow – unicast / multicast tradeoff
Every interest and rate
distribution has an
optimal amount of
unicast BW it can use
The hybrid approach
improves upon both
unicast-only and
multicat-only
Unicast BW allocation – exact amount of unicast BW used
17
IBM Haifa Research Lab
Conclusions
We have presented a novel hybrid approach for publish subscribe
We have shown using extensive and realistic simulation results that our
approach reduces consumed network and host resources
K-Means (2005) performs best for channelization, from the selection of
algorithms we tested
Greedy hybrid heuristics performed best in our tests
Relative competitiveness of the greedy-flows & greedy-users heuristics
depends on the structure of the interest matrix and rate
18
~ The End ~
IBM Haifa Research Lab
Real Life Messaging Load Model
Model based on
statistical analysis of
NYSE daily trade data
20K Topics
500 Subscribers
Avg. ~70 flows / user
Min 15
flows / user
Max 115 flows / user
Avg. message fan out
~10.1 clients
Multicast - message is
transmitted once
Unicast transmitter data
rate is x10 of multicast !
19
Backup – Model
IBM Haifa Research Lab
Messaging Load Model – Based on Market Research
4
Number of trades
~10% of Symbols
~55% of trade
3
10
2
10
1
10
0
10
0
1000
2000
Symbol rank
3000
4000
Avg. Message Rate
20
15
Backup – Model
Market 1
Market 2
10
Market 10
5
0
0
20
Daily trade, July 7 2004
Expo. fit
Daily trade min/max in July
10
Msg/Sec
Financial front office
Hundreds of users, requiring stock quotes
and financial information from several
markets
Topic space structure
Within each market, symbol popularity and
rate are exponentially distributed (NYSE
market research)
Several different markets, with Avg.
popularity and size prop. ~1/m (assumption).
20K flows, 10 markets, 500 users
User interest
Each user: selects some markets, selects a
percent of the symbols from each chosen
market, according to the said distributions
NYSE daily trade
5
10
0.5
1
1.5
Symbols, by Market and Rank
2
4
x 10
IBM Haifa Research Lab
Interest Matrix
Mapping Algorithm
Topics
Input
interest matrix, topic rate vector
v x x x x
x v v x x
x x v v v
Users
Basic insight
Put “similar” topics in the same
group
“Similar” topics have a similar
audience
A group with a homogenous
audience causes less filtering
Topics with identical audience
Topics with similar audience
Topics
Take the rate into account
The cost of putting two topics
in the same group
The cost of adding a new topic
to a group of topics
Users
1
2
3
4
1
2
v
v
x
x
x
v
v
x
R2
0
R1
0
R1+ R2
Rk – the rate of topic k
21
Filtering Cost
Backup – Algorithm
IBM Haifa Research Lab
T1
Iterative Clustering Algorithm (K-means)
Init: Topics are assigned into a fixed number of groups
Move: In each step, remove a single topic, and move it
to the best group – the one producing the lowest cost
Cost: After each epoch, compute total filtering cost
Stop: time elapsed | cost does not improve | exceeded
max number of iterations
Topic group
1 2 3
Users
v
v
v
x
x
x
x
v
v
x
x
x
v
x
v
v
x
x
Group
audience
vector
v
v
v
v
x
x
Candidate
topic 5
v
v
v
x
v
x
The cost of adding
topic 5
to topic group {1,2,3}
0
0
0
R5
R1+R2+R3
0
R1+R2+R3+R5
22
Backup – Algorithm
T2
?
T3
T4
T5
T6
T5
?
T7
T8
?
T9
The best group for
topic K
is the group
with the lowest cost