(pptx)

Intel
Concurrent Collections
(for Haskell)
Ryan Newton*, Chih-Ping Chen*, Simon Marlow+
*Intel
+Microsoft Research
Software and Services Group
Jul 29, 2010
Goals
•Push Parallel Scaling in Haskell
Streaming/Actors/
Dataflow
−Never trivial
−Harder with: Laziness, complex runtime
Cn C
•Offer another flavor of parallelism:
−Explicit graph based
a `par` b
forkIO$ do…
Pure operand parallelism
Data Parallelism
Threaded parallelism
[: x*y | x<-xs | y<-ys :]
Software and Services Group
2
Intel Concurrent Collections (CnC)
•These models:
A bit more domain specific
(than, say, futures)
•A payoff in parallelism achieved / effort
•CnC is approximately:
Streaming/Actors/
Dataflow
Cn C
− stream processing +
shared data structures
•Intel CnC: A library for C++
−Also for Java, .NET thanks to Rice U. (Vivek Sarkar et al)
•Yet… CnC requires and rewards pure functions -Haskell is a natural fit.
Software and Services Group
3
Goals
How would I
(typically) do
message
passing in
Haskell?
a `par` b
forkIO$ do…
Pure operand parallelism
Data Parallelism
Threaded parallelism
[: x*y | x<-xs | y<-ys :]
Software and Services Group
4
Message passing in Haskell
•IO Threads and Channels:
do action..
writeChan c msg
action..
do action..
readChan c
action..
This form of message passing sacrifices determinism.
Let’s try CnC instead…
Software and Services Group
5
CnC Model of Computation
•Tasks not threads
−Eager computation,
tags items
−Takes tag input,
1: 34
−Runs to completion
2:
tag
tags items
1:
2:
•Step reads data items
•Step produces data items
3:
•“Cnc Graph” =
5:
5:
6:
6:
network of steps
•DETERMINISTIC
4:
‘a’
99
λ tag.
Step
Step
3:
‘f’
4:
‘z’
tags
There’s a whole separate story about offline analysis
of CnC graphs - enables scheduling, GC, etc.
Software and Services Group
No time for it today!
6
How is CnC Purely Functional?
Domain
Range
Software and Services Group
7
How is CnC Purely Functional?
A set of tag/
item collections
Updates
(new tags, items)
MyStep
Domain
Range
Software and Services Group
8
An CnC graph is a function too
A set of tag/
item collections
Domain
A complete
CnC Eval
Range
Software and Services Group
9
Message passing in Haskell
IO Threads
and
Channels:
•Steps,
Items,
and
Tags
do action..
writeChan
c msg
Step1
action..
do action..
readChan c
Step2
action..
This form of message passing sacrifices determinism.
Let’s try CnC instead…
Software and Services Group
10
Haskell / CnC Synergies
C++ incarnation
Haskell incarnation
CnC Step
hopefully pure
execute method
Pure function
Complete CnC
Graph Execution
Graph execution
concurrently on other
threads. Blocking calls
allow environment to
collect results.
Pure function.
Enforce step purity.
Leverage the fact that
CnC can play nicely in a
pure framework.
Also, Haskell CnC is a fun experiment.
• We can try things like idempotent work stealing!
• Most imperative threading packages don’t support that.
• Can we learn things to bring back to other CnC’s?
Software and Services Group
11
How is CnC called from Haskell?
•Haskell is LAZY
•To control par. eval granularity
Fork
workers
Haskell CnC is mostly eager.
•When the result of a CnC graph
is needed, the whole graph
executes, in parallel, to
completion.
Need
result!
Par exec.
Recv.
result
−(e.g. WHNF = whole graph
evaluated)
Software and Services Group
12
A complete CnC program in Haskell
myStep items tag =
do word1 <- get items "left"
word2 <- get items "right"
put items "result" (word1 ++ word2 ++ show tag)
cncGraph =
do tags <- newTagCol
items <- newItemCol
prescribe tags (myStep items)
initialize$
do put items "left" "Hello "
put items "right" "World "
putt tags 99
finalize$
do get items "result"
main = putStrLn (runGraph cncGraph)
Software and Services Group
13
INITIAL RESULTS
Software and Services Group
14
I will show you graphs like this:
(4 socket 8 core Westmere)
Software and Services Group
15
Current Implementations
•IO based + pure impl.s
•Lots of fun with schedulers!
Thread per step instance.
Step(0)
−Lightweight user threads (3)
>Data.Map of MVars for data.
−Global work queue (4,5,6,7,10)
Step(1)
thr1
thr2
wrkr1
wrkr2
−Continuations + Work stealing (10,11)
>Simple Data.Sequence deques
Step(2)
thr3
…
wrkr3
Global
work
pool
wrkr1
wrkr2
−Haskell “spark” work stealing (8,9)
>(Mechanism used by ‘par’)
> Cilkish sync/join on
top of IO threads
Per-thread
work
dequesSoftware and Services Group
16
Back to graphs…
Software and Services Group
17
Recent Results
Software and Services Group
18
Recent Results
Software and Services Group
19
Recent Results
Software and Services Group
20
One bit of what we’re up against:
Example: Tight non-allocating
loops
Software and Services Group
21
Conclusions from early prototype
•Lightweight user threads (3)
 simple, predictable, but too much overhead
•Global work queue (4,5,6,7,10)
 better scaling than expected
•Work stealing (11)
 will win in the end, need good deques!
•Using Haskell “spark” work stealing (8,9)
 a flop (for now)
No clear scheduler winners! CnC mantra - separate tuning
(including scheduler selection) from application semantics.
Software and Services Group
22
Conclusions
•Push Scaling in Haskell
−Overcame problems (bugs, blackhole issues)
−GHC 6.13 does *much* better than 6.12
−Greater than 20X speedups
−A start.
•GHC will get scalable GC (eventually)
•A fix for non-allocating loops may be warranted.
•It’s open source (Hackage) - feel free to play the
“write a scheduler” game!
See my website for the draft paper! (google Ryan Newton)
Software and Services Group
23
Future Work, Haskell CnC Specific
•GHC needs a scalable garbage collector (in progress)
•Much incremental scheduler improvement left.
•Need more detailed profiling and better
understanding of variance
−Lazy evaluation and a complex runtime are a challenge!
Software and Services Group
25
Moving forward, radical optimization
•Today CnC is mostly “WYSIWYG”
−A step becomes a task (but we can control the scheduler)
−User is responsible for granularity
−Typical of dynamically scheduled parallelism
•CnC is moving to a point where profiling and graph analysis
enable significant transformations.
−Implemented currently: Hybrid static/dynamic scheduling
>Prune space of schedules / granularity choices offline
−Future: efficient selection of data structures for collections (next slide)
>Vectors (dense) vs. Hashmaps (sparse)
>Concurrent vs. nonconcurrent data structures
Software and Services Group
26
Tag functions: data access analysis
•How nice -- a declarative specification of data access!
(mystep : i) <- [mydata : i-1],
[mydata : i],
[mydata : i+1]
•Much less pain than traditional loop index analysis
•Use this to:
−Check program for correctness.
−Analyze density, convert data structures
−Generate precise garbage collection strategies
−Extreme:
full understanding of data deps. can replace control dependencies
Software and Services Group
27