Presentation

A Hybrid Multicast-Unicast Infrastructure for
Efficient Publish-Subscribe in
Enterprise Networks
Danny Bickson, Ezra N. Hoch, Nir Naaman and Yoav Tock
IBM Haifa Research Lab, Israel
IBM Haifa Research Lab
Outline
 Motivation
 The channelization problem
 Our hybrid approach
 Experimental results
 Conclusions
2
IBM Haifa Research Lab
Motivation: large scale publish subscribe application
 Large number of information flows (topics)
and subscribers
 Each flow must be delivered to a subset of
interested subscribers
 Example: financial market data
dissemination
 Publisher divides data feed into a large number
information flows, (~100K)
e.g. stock symbols, futures, commodities
 Many stand-alone subscribers (~1K)
 Subscribers display interest heterogeneity are interested in different yet overlapping subsets
of the topics
 Any single topic may be delivered to a large
number of subscribers (hot / cold topics)
3
Data Vendor
WAN
Subscribers
Publisher
Enterprise LAN
Multiple information
flows (Topics)
IBM Haifa Research Lab
Common approaches
 Use unicast (point-to-point) connections
 Limitations: poor utilization of network resources (duplicate
transmissions)
 Use broadcast (single multicast channel)
 Limitations: receivers filter unwanted content
 Utilize multicast to transmit data
 Topics are mapped into multicast groups. Each user joins the
groups that cover his topic-interest.
 Reduces receiver filtering
 Limitations: limited amount of multicast addresses
Network element state problem
Receiver resources (NICs)
4
IBM Haifa Research Lab
Our novel contribution
 Create a hybrid approach that combines both multicast and unicast
 Flexible allocation of transmissions
 Topics with high interest enjoy efficiency of multicast
 Topics with low interest are transmitted in unicast
 Formalize as an optimization problem
 Propose a two step alternating method for computing the resource
allocation
5
IBM Haifa Research Lab
The Channelization Problem
n flows
Flow rates λ
k multicast groups
m users
Interest matrix W
The task: find mapping matrices X,Y that minimizes the communication cost
The cost of transmission – take into account transmission to multiple
groups
The cost of reception – minimize excess filtering
6
IBM Haifa Research Lab
The Hybrid Channelization Problem
Flows
X – flow to
group map
Multicast Groups
G1
G2
Y – user subscription map
Users
F1
Gk
F2
U1
F1 F2
F3
U2
F1 F2 F8
F4
U3
F3 F4 F6
Fn
Um
F1 Fn
T – unicast
transmission map
7
Interest
Extraction (W)
IBM Haifa Research Lab
The Hybrid Channelization Problem
 Modified cost function
Cost of multicast reception
Cost of multicast transmission
Cost of unicast
reception & transmission
 Problem objective is
8
IBM Haifa Research Lab
Proposed Solution
 Unfortunately the hybrid problem is NP-hard
 We propose a two step heuristic solution
 First step: solve the channelization problem (multicast mapping)
 Second step:
Choose flow-user pairs for unicast,
Remove redundant assignments from multicast mapping
Recalculate the cost
 Iterate until convergence, or unicast BW limit exceeded
9
IBM Haifa Research Lab
First step: channelization problem solution
 We have experimented with the following algorithms
 K-Means (2005) performs best
10
IBM Haifa Research Lab
K-Means Mapping Algorithm
 Input
 Interest matrix, topic rate vector
Interest Matrix =
Topics
Users
v
x
x
v v x x
x v v v
x
x
x
x
 Basic insight
 Put “similar” topics in the same group
 “Similar” topics have a similar audience causes less filtering
User’s
Interest Vector
Rate Vector =
Topic’s
Audience Vector
R1 R2
…
RK
 Take the rate into account
T1
T2
 Iterative Clustering Algorithm (K-means)
 Init: Topics are assigned into a fixed number of groups
 Move: In each step, remove a single topic, and move it to
the best group – the one producing the lowest cost
 Cost: After each epoch, compute total filtering cost
 Stop: cost doesn’t improve | time elapsed | max # iter.
11
?
T3
T4
T5
T6
T5
?
T7
T8
?
T9
IBM Haifa Research Lab
Second step: choosing user-flow pairs for unicast
 Experimented with several heuristics
 Heavy users - all transmission to a specific heavy user is sent using
unicast
 Lightweight flows - flows with low bandwidth are sent using unicast
 Greedy flows - move to unicast the flow which best minimizes the
total cost
 Greedy users - move to unicast the user which best minimizes the
total cost
 An additional heuristic - Greedy user-flow pairs – move to unicast
the user-flow pair which best minimizes the total cost - very slow,
impractical run-time
12
IBM Haifa Research Lab
Experimental results
 Construction of user-interest matrix W
 Random, uniform
 Market distribution – based on a model of NYSE stock volume
 IBM WebSphere cell – a real system
13
IBM Haifa Research Lab
Channelization algorithms
K-Means (2005)
performs best
Takes rate into account
Gradient decent on the
true cost function
14
IBM Haifa Research Lab
Effect of the interest matrix on channelization performance
 The interest and rate
have a significant effect
on channelization
performance
 Some interests have
patterns that are easy to
“channelize”
 Interests with less
entropy, more order, are
easier
15
IBM Haifa Research Lab
Hybrid Algorithm Heuristics
Unicast BW limit – algorithm will use optimal amount up to the limit
Market dist. - Greedy users
Can use more unicast BW
16
WebSphere dist. - Greedy flows
Doesn’t need more than 20% unicast BW
IBM Haifa Research Lab
Hybrid using greedy flow – unicast / multicast tradeoff
 Every interest and rate
distribution has an
optimal amount of
unicast BW it can use
 The hybrid approach
improves upon both
unicast-only and
multicat-only
Unicast BW allocation – exact amount of unicast BW used
17
IBM Haifa Research Lab
Conclusions
 We have presented a novel hybrid approach for publish subscribe
 We have shown using extensive and realistic simulation results that our
approach reduces consumed network and host resources
 K-Means (2005) performs best for channelization, from the selection of
algorithms we tested
 Greedy hybrid heuristics performed best in our tests
 Relative competitiveness of the greedy-flows & greedy-users heuristics
depends on the structure of the interest matrix and rate
18
~ The End ~
IBM Haifa Research Lab
Real Life Messaging Load Model
 Model based on
statistical analysis of
NYSE daily trade data
 20K Topics
 500 Subscribers
 Avg. ~70 flows / user
 Min 15
flows / user
 Max 115 flows / user
 Avg. message fan out
~10.1 clients
 Multicast - message is
transmitted once
 Unicast transmitter data
rate is x10 of multicast !
19
Backup – Model
IBM Haifa Research Lab
Messaging Load Model – Based on Market Research
4
Number of trades
~10% of Symbols
~55% of trade
3
10
2
10
1
10
0
10
0
1000
2000
Symbol rank
3000
4000
Avg. Message Rate
20
15
Backup – Model
Market 1
Market 2
10
Market 10
5
0
0
20
Daily trade, July 7 2004
Expo. fit
Daily trade min/max in July
10
Msg/Sec
 Financial front office
Hundreds of users, requiring stock quotes
and financial information from several
markets
 Topic space structure
Within each market, symbol popularity and
rate are exponentially distributed (NYSE
market research)
Several different markets, with Avg.
popularity and size prop. ~1/m (assumption).
20K flows, 10 markets, 500 users
 User interest
Each user: selects some markets, selects a
percent of the symbols from each chosen
market, according to the said distributions
NYSE daily trade
5
10
0.5
1
1.5
Symbols, by Market and Rank
2
4
x 10
IBM Haifa Research Lab
Interest Matrix
Mapping Algorithm
Topics
 Input
 interest matrix, topic rate vector
v x x x x
x v v x x
x x v v v
Users
 Basic insight
 Put “similar” topics in the same
group
 “Similar” topics have a similar
audience
 A group with a homogenous
audience causes less filtering
Topics with identical audience
Topics with similar audience
Topics
 Take the rate into account
 The cost of putting two topics
in the same group
 The cost of adding a new topic
to a group of topics
Users
1
2
3
4
1
2
v
v
x
x
x
v
v
x
R2
0
R1
0
R1+ R2
Rk – the rate of topic k
21
Filtering Cost
Backup – Algorithm
IBM Haifa Research Lab
T1
Iterative Clustering Algorithm (K-means)
 Init: Topics are assigned into a fixed number of groups
 Move: In each step, remove a single topic, and move it
to the best group – the one producing the lowest cost
 Cost: After each epoch, compute total filtering cost
 Stop: time elapsed | cost does not improve | exceeded
max number of iterations
Topic group
1 2 3
Users
v
v
v
x
x
x
x
v
v
x
x
x
v
x
v
v
x
x
Group
audience
vector
v
v
v
v
x
x
Candidate
topic 5
v
v
v
x
v
x
The cost of adding
topic 5
to topic group {1,2,3}
0
0
0
R5
R1+R2+R3
0
R1+R2+R3+R5
22
Backup – Algorithm
T2
?
T3
T4
T5
T6
T5
?
T7
T8
?
T9
The best group for
topic K
is the group
with the lowest cost