An Evaluation of Parallel Graph Partitioning and Ordering Softwares on a Massively Parallel Computer

RC 25008 (W1006-029) June 9, 2010
Computer Science/Mathematics
IBM Research Report
An Evaluation of Parallel Graph Partitioning and Ordering Softwares on a
Massively Parallel Computer
Anshul Gupta
IBM T. J. Watson Research Center
1101 Kitchawan Road
Yorktown Heights, NY 10598
[email protected]
Research Division
Austin China Haifa India Tokyo Watson Zurich
IBM Almaden
An Evaluation of Parallel Graph Partitioning and Ordering
Softwares on a Massively Parallel Computer
Anshul Gupta
Business Analytics and Mathematical Sciences
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA.
[email protected]
IBM Research Report RC 25008 (W1006-029)
June 9, 2010
Abstract
We empirically study state-of-the-art parallel graph partitioning and sparse matrix ordering software
packages. We compare their speed, quality, and robustness. For a model case, in which good partitionings
(even optimal partitionings, in some cases) can be constructed manually, we compare the size of the edge
cuts of the manual partitions with that of the partitions generated by the multilevel heuristics that are at the
heart of modern graph partitioning software packages. We show that the quality of the partitions generated
by the software is only slightly worse than that of the manual partition for this class of model graphs. We
discuss the shortcomings of the current ordering software and argue that there is an urgent need for more
robust, scalable, and high-quality software for sparse matrix ordering to support scalable solution of sparse
linear systems by direct methods on massively parallel computers.
1 Introduction
This report contains the results of a concise experimental study of two state-of-the-art parallel graph partitioning
software packages and three parallel sparse matrix ordering packages on the Blue Gene/P [8]. For our study, we
use a set of a dozen graphs derived from symmetric sparse matrices of moderate sizes. These matrices, listed in
Table 1, were obtained from the University of Florida collection [4]. We are aware of three distributed-memory
parallel software packages for graph partitioning, namely ParMETIS [7], PT-SCOTCH [3], and JOSTLE [10].
Since neither the source code, nor binaries compiled for our target machine, BG/P, are available for JOSTLE,
we restricted the study to ParMETIS and PT-SCOTCH. We are aware of another parallel sparse matrix ordering
software known as PORD [9], but did not include it in the study. PORD does not contain graph partitioning
software, and based on our earlier experience with MUMPS [1, 2] (a parallel sparse linear system solver package
that uses PORD ordering as the default), the quality of the orderings produced by Parallel Watson Sparse
Matrix Package (WSMP) [6] is generally better. Therefore, we have included ordering results from the latter
for comparison with ParMETIS and PT-SCOTCH. We use Version 9.9 of WSMP, Version 3.2 of ParMETIS,
and Version 5.2.9.
1
Matrix
Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
N
1,073,724
5,978,665
943,695
1,585,478
665,017
902,289
1,508,088
72,000
1,522,431
1,371,166
1,2281,045
201,142
NNZ
84,317,460
29,640,547
76,651,847
7,660,826
107,514,163
70,925,391
111,614,436
28,715,634
125,755,875
108,009,680
8,580,313
3,161,120
Application/Origin
Static stress analysis
Power network analysis
Automotive crankshaft modeling
Circuit simulation
Structural analysis
Metal stamping
Structural analysis
3D mesh problem
Structural analysis
3-D metal forming
Steady state thermal problem
Human torso modeling
Table 1: SPD test matrices with their order (N) and number of non-zeros (NNZ).
2 Graph Partitioning Results
In this section, we compare ParMETIS and PT-SCOTCH for graph partitioning. Both these packages employ
multi-level heuristics. In multi-level partitioning, a graph is first coarsened by successively finding matchings,
and then eliminating the matched edge by coalescing the vertices that it connects. After sufficient number of
coarsening steps, when the graphs has been reduced to a size below a predetermined threshold, it is partioned.
Finally, the coarsening is undone one level at a time, while refining the partitioning at each step. When the
graph is fully uncoarsened to its original state, the partitioning is complete after the refinement step. More
detailed descriptions of the parallel multilevel partitioning heuristics can be found in the relevant literature [3,
7]. A key difference between the partitioning strategies of ParMETIS and PT-SCOTCH is that ParMETIS
performs only one cycle of coarsening and refining irrespective of the number of partitions desired, whereas
PT-SCOTCH performs
cycles of coarsening and refining to partition a graph into
parts. This
is because ParMETIS partitions the coarsest graph into parts and refines the complete partitioning at each
uncoarsening step. PT-SCOTCH computes only a bisection or a 2-way partition at a time. It then bisects the
steps.
partitions recursively until the desired number of partitions is computed. Clearly, this entails
2.1 Comparison of partitioning quality and speed
Tables 2–5 show the results of partitioning the graphs in our test suite on 16, 64, 256, and 1024 BG/P nodes. In
each table, we report the partitioning time and quality of ParMETIS and PT-SCOTCH for partitioning the graphs
into 2, , and parts, where is the number of nodes. We measure and report two metrics for quality of the
partitioning. These are the total edge cut or the number of edges that connect vertices belonging to different
partitions, and the maximum imbalance or the percentage by which the size of the largest partition exceeds
the average partition size. Both partitioner accept a use defined threshold of acceptable maximum imbalance.
We used a threshold of 5% in all our experiments. Default values are used for all other parameters for both
the packages. A hyphen in the tables indicates a failure of the configuration corresponding to that entry, either
because of the partitioner crashing or running out of memory. For a ready comparison of the two packages, the
2
Number of
Partitions Matrices Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
Geometric mean
Average
Geometric mean
ParMETIS
2
PT-SCOTCH
2
425039
4.47%
2.94
4081
0.14%
6.67
106853
1.03%
9.56
1994
0.00%
1.56
154301
0.99%
3.00
382815
0.00%
2.46
103517
0.63%
16.1
795640
2.37%
3.02
134640
1.60%
22.0
348777
2.76%
4.19
1087
0.01%
1.42
10440
4.43%
0.38
1984596
4.39%
3.03
26618
2.16%
6.81
1357326
4.66%
10.4
14039
3.68%
1.60
2481785
5.00%
3.87
1834896
4.78%
2.63
973744
4.99%
16.9
3061013
5.02%
4.21
1591912
4.99%
23.2
2175602
4.57%
4.56
13995
3.54%
1.51
54662
4.85%
0.41
2720681
4.78%
3.29
38260
3.92%
6.86
2027681
4.98%
10.9
19295
4.88%
1.66
3962014
5.03%
4.26
2458087
4.82%
2.89
1499380
4.99%
17.4
3899140
5.33%
4.64
2558298
4.87%
24.0
3182763
4.87%
4.82
22310
3.99%
1.55
80629
5.00%
0.44
2680
0.00%
5.86
107217
0.85%
9.24
1244
1.80%
1.30
145761
1.46%
4.20
379260
0.11%
4.53
86274
1.27%
14.5
701355
2.00%
4.88
126252
1.98%
19.1
355428
1.09%
5.41
1054
1.19%
0.98
9772
1.99%
0.51
15916
2.54%
16.0
1367289
4.39%
30.0
9914
3.44%
3.80
2298571
5.25%
18.1
1761246
4.86%
14.6
899577
4.90%
43.5
3019536
6.49%
13.4
1509588
5.92%
54.9
2140359
4.61%
20.7
13080
4.05%
2.81
52758
5.73%
1.62
24537
2.91%
19.8
2072169
5.17%
40.7
15152
3.88%
4.76
3975585
6.30%
34.0
2446541
3.84%
21.3
1491046
5.95%
56.8
4067957
7.56%
14.3
2466981
6.96%
70.7
3096725
5.36%
34.1
20408
4.30%
3.34
79912
6.80%
2.43
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
46608
1.27%
3.70
370839
4.39%
4.09
544540
4.79%
4.32
41016
1.25%
4.06
331369
4.74%
12.8
506780
5.37%
17.4
Edges cut
Max. imbalance
Time (seconds)
Table 2: Partitioning statistics for ParMETIS and PT-SCOTCH on 16 processes (
3
).
bottom of each table contains geometric means of the edge cut, maximum imbalance, and partitioning time for
those graphs for which none of the partitioners failed.
The comparison in Table 2 shows that the quality of PT-SCOTCH partitions is better than those of ParMETIS
on 16 nodes. As expected, PT-SCOTCH takes longer to compute 16 or 32 partitions because of its recursive bisection strategy, unlike the single step multiway partitioning strategy used in ParMETIS. The results in Table 3
for 64 nodes are similar to those in Table 2. A notable difference, however, is that the maximum imbalance in
PT-SCOTCH partitioning into 64 and 128 parts exceeds the 5% threshold. The cause of growing imbalance in
PT-SCOTCH lies in its recursive bisection strategy. PT-SCOTCH treats each bisection as an independent partitioning problem. A given bisectioning problem does not take into account whether it is partitioning the smaller
or the larger partition from the previous bisection. Ideally, the balancing requirements on the larger partition
need to be more stringent in order to guarantee that the imbalance in the final partitioning does not exceed the
user defined threshold. In the absence of an adaptive management of imbalance in each bisection computation,
the imbalance gets compounded at each level of bisection. Tables 4 and 5 show that this phenomenon gets
more pronounced as the number of nodes and partitions increases. Although the size of the edge cuts produced
by PT-SCOTCH remains smaller than those produced by ParMETIS, the high imbalance renders PT-SCOTCH
partitioning into more than 64 parts impractical for real applications.
PT−SCOTCH edge−cut to ParMETIS edge−cut ratio (left scale)
ParMETIS percentage imbalance / 5% (left scale)
PT−SCOTCH percentage imbalance / 5% (left scale)
4
ParMETIS partitionioning time (right scale)
15 s
PT−SCOTCH partitioning time (right scale)
3
10 s
2
Partitining Time
Partitining Quality Metrics
5s
1
16
64
256
1024
Number of BG/P Nodes Used (P)
Figure 1: ParMETIS and PT-SCOTCH comparison when the number of partitions is the same as the number of
MPI processes.
Figure 1 contains a graphical depiction of the geometric mean data from Tables 2–5. Note that for 1024
partitions on 1024 MPI processes, the geometric mean of the imbalance for ParMETIS too seems to go beyond
4
Number of
Partitions Matrices Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
Geometric mean
Average
Geometric mean
ParMETIS
2
PT-SCOTCH
2
446459
1.84%
1.18
4036
0.03%
2.05
109180
0.50%
2.58
1874
2.27%
0.61
135529
0.43%
1.28
384021
0.85%
1.08
88479
0.76%
6.37
695354
1.60%
6.37
131417
1.46%
7.05
365490
0.54%
1.64
1290
0.00%
0.50
11320
2.83%
0.20
3665634
4.94%
1.38
48122
3.51%
2.22
2965230
4.99%
3.13
33245
4.54%
0.67
5751471
5.00%
1.82
3244471
4.98%
1.20
2225284
4.99%
7.11
5303531
5.42%
9.87
3984616
5.00%
7.99
4377504
4.65%
1.90
32837
4.53%
0.56
114436
4.61%
0.24
4802692
4.87%
1.48
64533
4.53%
2.29
4162268
5.00%
3.38
53040
4.28%
0.70
8079937
5.27%
2.08
4313942
4.99%
1.29
3263959
5.01%
7.50
6993334
7.56%
12.1
5717904
5.00%
8.42
5672840
4.99%
2.01
49497
4.46%
0.60
156181
4.94%
0.25
427831
0.42%
3.39
2414
0.90%
2.61
108981
1.03%
3.46
1333
0.05%
0.50
129529
0.96%
2.59
376713
0.31%
3.00
90180
1.47%
5.52
689704
2.00%
2.79
133236
0.33%
6.10
354393
0.63%
3.32
1077
0.46%
0.34
10126
1.14%
0.31
3660252
10.1%
13.3
32241
4.70%
10.4
2986228
8.85%
16.1
22183
5.47%
2.16
5780260
8.59%
15.9
3195067
7.40%
11.4
2253226
7.99%
23.3
5404439
11.6%
7.93
3887558
8.74%
25.2
4337095
7.15%
17.1
32545
6.93%
1.62
113099
8.37%
1.25
4996580
11.1%
16.4
47685
5.68%
11.3
4241224
9.46%
18.9
36881
5.51%
2.58
8195493
9.69%
18.8
4453477
7.97%
14.0
3255261
9.07%
28.0
6979632
12.7%
8.17
5747958
9.83%
30.8
5725184
7.75%
20.8
47185
7.59%
1.81
157246
9.51%
1.44
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
55357
1.09%
1.58
891198
4.76%
1.88
1243358
5.08%
2.03
49832
0.81%
1.99
831423
7.99%
8.49
1183434
8.82%
9.92
Edges cut
Max. imbalance
Time (seconds)
Table 3: Partitioning statistics for ParMETIS and PT-SCOTCH on 64 processes (
5
).
Number of
Partitions Matrices Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
Geometric mean
Average
Geometric mean
ParMETIS
2
PT-SCOTCH
2
!
447753
0.95%
0.41
3145
0.00%
1.17
112311
0.26%
1.08
1878
0.00%
0.60
144141
0.11%
0.50
412146
0.29%
0.34
96904
0.14%
2.32
730119
1.29%
6.33
135873
1.21%
2.80
366462
1.07%
0.53
1129
0.00%
0.59
13509
1.57%
0.29
6252550
5.00%
0.56
84722
4.55%
1.38
5710116
5.31%
1.40
78502
3.74%
0.69
10920589
5.25%
0.85
5508116
5.12%
0.49
4773917
5.16%
2.28
8684320
10.9%
10.4
7651743
5.03%
3.38
7319344
5.02%
0.72
70447
4.69%
0.71
213203
4.87%
0.35
7927361
5.00%
0.68
118210
4.86%
1.48
7524868
5.31%
1.57
105678
3.43%
0.70
14247071
5.26%
1.07
7063156
5.20%
0.60
6611421
5.28%
3.06
11333968
27.3%
11.1
10200306
5.70%
3.72
9400789
5.19%
0.89
100101
4.43%
0.71
281944
5.13%
0.43
425736
0.82%
1.98
6435014
13.2%
7.62
8085763
14.0%
8.30
128457
2.00%
2.47
346923
1.37%
2.11
1126
2.00%
0.21
9941
0.21%
0.25
7729759
15.7%
12.5
7489539
15.0%
10.3
71706
10.3%
1.10
214744
10.1%
1.15
10351402
16.9%
13.4
9753346
15.0%
11.3
99916
10.9%
1.17
285206
11.2%
1.19
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
52173
0.61%
0.72
1581794
4.88%
0.89
2100704
4.96%
1.04
46672
1.16%
1.00
1541061
11.8%
4.90
2068988
12.4%
5.15
Edges cut
Max. imbalance
Time (seconds)
104535
0.84%
1.58
1332
1.92%
0.30
5790781
12.6%
8.98
55433
9.05%
1.49
7707901
12.6%
9.56
80242
9.57%
1.62
384759
1.02%
1.86
92358
0.29%
2.11
5700756
10.6%
8.54
4669752
10.1%
11.4
7317206
10.7%
7.55
6598890
11.1%
12.4
Table 4: Partitioning statistics for ParMETIS and PT-SCOTCH on 256 processes (
6
"
).
PT−SCOTCH edge−cut to ParMETIS edge−cut ratio (left scale)
ParMETIS percentage imbalance / 5% (left scale)
15 s
PT−SCOTCH percentage imbalance / 5% (left scale)
ParMETIS partitionioning time (right scale)
3
PT−SCOTCH partitioning time (right scale)
10 s
2
Partitining Time
Partitining Quality Metrics
4
5s
1
16
64
256
1024
Number of BG/P Nodes Used
Figure 2: ParMETIS and PT-SCOTCH comparison for graph bisection.
5%. However, a closer look at Table 5 would reveal that the statistics are skewed by a single bad case: graph
nd24k, which has only 72000 vertices and would have very few vertices per partition in a 1024-way partitioning.
Figure 2 shows similar statistics for graph bisection (2-way partition) on number of MPI processes increasing
from 16 to 1024. This figure, along with Table 2, shows that partition quality obtained by PT-SCOTCH is
slightly better than that obtained from ParMETIS for small number of partitions (irrespective of the number of
MPI processes).
2.2 Heuristic versus manual partitions for a model problem
In this section, we consider partitioning regular unweighted 3-D graphs with a 7-point cubic stencil, where
each vertex is connected to its immediate neighbors in both directions along # , $ , and % axes. Tables 6 and 7
compare the edge-cuts of partitions generated by ParMETIS with balanced manual partitions created simply
by introducing cutting planes along each dimension to maintain load balance and to maintain the best possible
aspect ratio of of the dimensions of the partitions.
It can be observed that the ratio of the sizes of cuts produced by ParMETIS to those generated manually is
close to 1.5 for a wide range of graph sizes, partition sizes, and number of MPI processes. The ratio deteriorates
slightly as the size of the graph increases, and improves very slightly as the number of partitions increases. In
fact, a close observation reveals that the ratio is almost constant for a given average number of vertices per
partition. Table 7 shows that when the number of MPI processes is kept constant at 32 and the number of
7
Number of
Partitions Matrices Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
Geometric mean
Average
Geometric mean
ParMETIS
2
PT-SCOTCH
2
!
475753
0.80%
0.32
3773
0.55%
2.30
109746
0.11%
0.81
2156
0.00%
1.26
136665
0.48%
1.77
413838
0.72%
0.45
98559
0.04%
7.69
723707
1.34%
6.55
145908
1.24%
9.44
380961
0.03%
0.46
1212
0.01%
2.05
12347
0.71%
0.52
10074683
5.29%
0.57
171165
4.80%
2.59
9745613
5.58%
1.15
140353
4.24%
1.04
18519743
7.02%
1.50
8987618
5.54%
0.78
9004369
5.65%
8.06
12348960
39.4%
10.9
13429147
5.87%
8.80
11941106
5.37%
0.84
141401
4.31%
2.32
365929
5.38%
0.72
12667459
5.86%
0.96
239776
4.68%
2.87
12534227
7.21%
2.36
188705
5.92%
1.49
23575523
7.48%
2.84
11280209
7.36%
1.07
12076573
6.47%
8.11
13154160
39.4%
11.0
17272737
6.41%
10.1
15030032
5.75%
1.38
197978
4.06%
2.42
469777
5.89%
1.01
419560
1.56%
1.80
2458
0.49%
1.81
113211
1.46%
1.26
1483
2.00%
0.58
130270
1.00%
1.15
371628
0.89%
1.73
88893
0.78%
1.44
694100
2.06%
2.70
132912
0.59%
1.71
359883
1.48%
1.70
1102
0.93%
0.51
10310
0.39%
0.42
10490912
14.4%
6.16
143195
9.34%
5.73
9954688
14.8%
6.32
114184
9.47%
1.85
18840372
16.4%
7.11
9277671
14.1%
5.47
8962402
16.0%
7.31
11994364
18.0%
6.40
13698648
13.6%
7.86
12312997
14.5%
7.47
148107
11.8%
1.44
369946
14.0%
1.47
13142764
15.8%
6.20
206839
9.24%
5.85
12761406
16.1%
6.36
161962
9.93%
1.89
24082589
17.6%
7.17
11628226
16.9%
5.47
12093880
17.2%
7.42
13097836
22.3%
6.42
17561488
14.9%
7.95
15521890
15.2%
7.61
203343
11.9%
1.47
475324
15.1%
1.47
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
Edges cut
Max. imbalance
Time (seconds)
57919
0.50%
1.51
2905685
8.20%
1.90
3733212
8.87%
2.59
50590
1.14%
1.23
2857623
13.9%
4.61
3699231
15.2%
4.67
Edges cut
Max. imbalance
Time (seconds)
Table 5: Partitioning statistics for ParMETIS and PT-SCOTCH on 1024 processes (
8
&
).
'
partitions is increased, then the ratio improves somewhat faster than what we observe in Table 6. This shows
that the partition quality declines very slightly with increasing number of MPI processes.
Cube (( ), Graph ((*) )
Dimensions
(,+
(*)-+
(,+
(*)-+
(,+
(*)-+
(,+
(*)-+
128
2097152
256
16777216
448
89915392
576
191102976
32-way
64-way
128-way
256-way
512-way
1024-way
168722
114688
1.47
692838
458752
1.51
2119976
1404928
1.51
3543317
2322432
1.53
232199
*147456
1.57
941707
*589824
1.60
2951874
*1806336
1.63
4946173
*2985984
1.66
305206
212992
1.43
1264127
851968
1.48
3982187
2609152
1.53
6743418
4313088
1.56
396226
278528
1.42
1665213
1114112
1.49
5207300
3411968
1.53
8787701
5640192
1.56
508356
*344064
1.48
2137405
*1376256
1.55
6758690
*4214784
1.60
11267256
*6967296
1.62
635988
475136
1.34
2722783
1900544
1.43
8601565
5820416
1.48
14477239
9621504
1.50
Metis
Manual
Ratio
Metis
Manual
Ratio
Metis
Manual
Ratio
Metis
Manual
Ratio
Table 6: A comparison of the best (smallest possible) and ParMETIS edge-cuts for unweighted 3-D graphs with
a 7-point cubic stencil. The number of BG/P nodes used is the same as the number of partitions. A ‘*’ indicates
an optimal partition.
Cube (( ), Graph ((*) )
Dimensions
(.+
(*)-+
256
16777216
32-way
64-way
128-way
256-way
512-way
1024-way
692838
458752
1.51
918806
*589824
1.56
1216633
851968
1.43
1589311
1114112
1.43
2036719
*1376256
1.48
2577758
1900544
1.36
Metis
Manual
Ratio
Table 7: A comparison of the best (smallest possible) and ParMETIS edge-cuts for unweighted 3-D graphs with
a 7-point cubic stencil. 32 BG/P nodes are used for each partition. A ‘*’ indicates an optimal partition.
2.3 Conclusions from partitioning experiments
The results in Sections 2.1 indicate that PT-SCOTCH outperforms ParMETIS in terms of partition quality for a
small number of partitions. However, for a large number of partitions, ParMETIS is the only practical option.
The results in Section 2.2 indicate that, at least for a 3-D model problem, the quality of the partitions
generated by the parallel multilevel heuristics is worse than the optimal partitioning by only a small factor, and
that this factor stays more or less constant over a wide range of graph sizes, MPI processes, and number of
partitions.
9
ParMETIS
PT-SCOTCH
Time
Opcount
Time
Opcount
PWSSMP
uncompressed
Time Opcount
Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
32.7
36.8
44.8
10.1
41.7
30.4
59.7
71.2
77.1
42.7
8.64
6.34
4.21e+13
1.31e+11
6.02e+12
6.46e+10
4.09e+12
2.80e+13
3.20e+12
2.04e+12
8.98e+12
4.98e+13
1.64e+10
1.40e+11
33.0
43.2
10.2
34.3
25.2
66.3
14.1
80.7
36.3
5.61
2.45
2.43e+11
5.99e+12
6.84e+10
4.05e+12
1.98e+13
3.16e+12
2.11e+12
8.71e+12
3.59e+13
1.77e+10
1.44e+11
228.
68.9
68.6
17.7
88.1
154.
87.6
131.
107.
179.
12.8
5.50
2.34e+13
1.52e+11
5.05e+12
6.94e+10
3.11e+12
1.62e+13
2.70e+12
2.06e+12
7.40e+12
2.91e+13
1.75e+10
1.27e+11
83.4
72.0
19.6
18.9
24.4
27.8
29.5
143.
34.1
41.0
14.4
6.44
2.31e+13
1.57e+11
5.20e+12
6.95e+10
3.27e+12
1.62e+13
3.76e+12
1.97e+12
7.84e+12
2.90e+13
1.76e+10
1.27e+11
Geom. mean
29.8
1.40e+12
21.6
1.41e+12
56.2
1.12e+12
28.4
1.25e+12
Matrices
PWSSMP
compressed
Time Opcount
Table 8: Ordering statistics for ParMETIS, PT-SCOTCH, and PWSSMP on 16 processes (
).
3 Parallel Sparse Matrix Ordering
In this section, we compare the time and quality of parallel fill-reducing ordering produced by ParMETIS, PTSCOTCH, and parallel WSMP on the set of matrices described in Table 1. ParMETIS and PT-SCOTCH use
a distributed multilevel strategy to compute vertex separators of the graph of the matrix to generate a nesteddissection [5] type permutation of the vertices. On the contrary, parallel WSMP generates a vertex separator
sequentially. It first computes a node bisector of the entire graph on a single MPI process. The two subgraphs
are subsequently processed independently in parallel and the process continues recursively. Note that there are
two potential disadvantages of this approach. First, only limited speedup from parallelism can be expected
because a significant amount of computation that goes into finding the first separator is performed sequentially,
and the parallelism increases gradually as the size of the subgraphs decreases. The second problem is that this
approach requires the entire graph to be stored on a single process to compute the first separator, and hence is
not scalable with respect to memory, although this may not be a problem for highly compressible graphs (see
next paragraph).
Some of the matrices in our test suite have graphs with multiple degrees of freedom per node. Parallel WSMP ordering can automatically detect this and take this into account to reduce memory and time requirements of ordering. On the other hand, the user must explicitly supply compressed graph information to
ParMETIS and PT-SCOTCH. In our experiments, we have used ParMETIS and PT-SCOTCH with the original
uncompressed graphs, and Parallel WSMP with the graph compression option turned both off and on. Default
values for all other parameters were used with the exception of matching type for ParMETIS, which was set
to PARMETIS MTYPE GLOBAL. Setting this to the default value of PARMETIS MTYPE LOCAL resulted
in some improvement in runtime and some deterioration in the quality of ordering. Tables 8–11 show the
parallel ordering results for the four cases on 16 to 1024 BG/P nodes. The last row of each table shows that
geometric means of the column values for all those matrices that all the packages were able to reorder. Figure 3
10
ParMETIS
PT-SCOTCH
Time
Opcount
Time
Opcount
PWSSMP
uncompressed
Time Opcount
Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
44.4
20.3
66.6
16.0
80.1
51.6
58.9
99.0
71.8
47.6
10.5
16.1
3.47e+13
1.24e+11
5.55e+12
6.21e+10
3.96e+12
2.21e+13
2.89e+12
2.03e+12
7.91e+12
4.40e+13
1.71e+10
1.19e+11
17.5
17.1
7.18
17.2
14.6
27.2
8.52
30.2
20.5
2.44
1.48
4.08e+13
5.86e+12
6.71e+10
4.08e+12
1.93e+13
3.27e+12
2.10e+12
8.37e+12
4.99e+13
1.82e+10
1.49e+11
140.
58.0
92.3
14.4
77.4
141.
84.3
130.
97.8
209.
11.5
4.93
2.29e+13
1.46e+11
5.07e+12
7.10e+10
3.23e+12
1.61e+13
2.78e+12
2.03e+12
7.33e+12
2.91e+13
1.72e+10
1.33e+11
76.4
61.3
18.8
15.7
23.3
27.2
27.2
129.
31.4
39.6
13.0
5.87
2.28e+13
1.46e+11
5.17e+12
7.09e+10
3.19e+12
1.62e+13
3.68e+12
2.10e+12
7.62e+12
2.88e+13
1.74e+10
1.33e+11
Geom. mean
41.7
2.14e+12
11.0
2.31e+12
58.2
1.90e+12
26.6
1.97e+12
Matrices
PWSSMP
compressed
Time Opcount
Table 9: Ordering statistics for ParMETIS, PT-SCOTCH, and PWSSMP on 64 processes (
ParMETIS
PT-SCOTCH
Time
Opcount
Time
Opcount
PWSSMP
uncompressed
Time Opcount
Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
189.
45.6
207.
39.6
272.
175.
173.
103.
215.
169.
28.8
18.3
2.67e+13
1.07e+11
5.06e+12
6.16e+10
3.06e+12
1.61e+13
2.61e+12
2.04e+12
7.10e+12
3.96e+13
1.66e+10
1.20e+11
8.93
8.29
5.03
8.24
7.84
11.1
6.91
13.3
10.8
1.38
1.30
2.57e+13
5.75e+12
7.17e+10
3.79e+12
2.74e+13
3.48e+12
2.10e+12
8.81e+12
3.60e+13
1.64e+10
1.38e+11
212.
58.4
90.0
14.1
73.2
139.
83.8
140.
132.
202.
11.1
4.88
2.34e+13
1.35e+11
4.94e+12
6.85e+10
3.08e+12
1.62e+13
2.78e+12
2.10e+12
7.54e+12
2.88e+13
1.73e+10
1.29e+11
72.2
61.4
18.7
15.4
23.1
26.4
27.4
150.
31.9
40.1
12.6
5.82
2.33e+13
1.36e+11
5.17e+12
6.83e+10
3.25e+12
1.62e+13
3.61e+12
2.10e+12
7.59e+12
2.87e+13
1.68e+10
1.29e+11
Geom. mean
109.
1.91e+12
6.15
2.20e+12
61.4
1.89e+12
26.7
1.95e+12
Matrices
PWSSMP
compressed
Time Opcount
Table 10: Ordering statistics for ParMETIS, PT-SCOTCH, and PWSSMP on 256 processes (
11
"
).
).
ParMETIS
PT-SCOTCH
Time
Opcount
Time
Opcount
PWSSMP
uncompressed
Time Opcount
Algor-big
Alsim-b
Audikw 1
G3 circuit
Lmco
Mstamp-2c
Nastran-b
Nd24k
Sgi 1M
Ten-b
Thermal2
Torso
255.
142.
248.
62.5
212.
327.
389.
298.
60.3
18.7
2.26e+13
1.03e+11
4.93e+12
5.15e+10
1.60e+13
2.54e+12
7.13e+12
2.91e+13
1.32e+10
1.20e+11
7.23
10.2
5.33
4.65
5.91
6.69
6.67
7.76
7.83
1.65
1.58
3.38e+13
1.81e+11
5.54e+12
8.41e+10
3.67e+12
1.74e+13
3.27e+12
9.25e+12
3.27e+13
1.78e+10
1.33e+11
169.
59.6
64.6
14.7
74.8
141.
75.9
139.
95.5
148.
11.4
5.16
2.29e+13
1.34e+11
4.97e+12
6.57e+10
3.00e+12
1.61e+13
2.64e+12
2.03e+12
7.16e+12
2.86e+13
1.71e+10
1.28e+11
79.5
62.6
19.8
16.1
23.2
27.5
27.9
144.
32.5
41.1
12.9
6.13
2.31e+13
1.34e+11
5.11e+12
6.54e+10
3.19e+12
1.62e+13
3.48e+12
2.10e+12
7.54e+12
2.86e+13
1.68e+10
1.28e+11
Geom. mean
148.
1.23e+12
5.14
1.61e+12
49.3
1.34e+12
25.8
1.39e+12
Matrices
PWSSMP
compressed
Time Opcount
Table 11: Ordering statistics for ParMETIS, PT-SCOTCH, and PWSSMP on 1024 processes (
&
).
summarizes these results graphically.
The results in Tables 8–11 and Figure 3 show that all ordering packages have limitations. PWSMP is
robust (no failures in any of the tests), and generates good quality orderings in reasonable time; however, as
stated earlier, its parallel ordering strategy is not scalable in terms of either ordering time or memory. Having
said that, it seems to have a fairly memory-frugal implementation relative to PT-SCOTCH (which runs out
of memory for one test case on 16 nodes) and ParMETIS (which runs out of memory for two test cases on
1024 nodes). In terms of ordering quality, PWSMP without graph compression generates orderings with the
least operation count on 16 and 64 nodes. ParMETIS catches up as the number of nodes increase, matching
PWSMP’s quality on 256 nodes and exceeding it on 1024. However, this improvement in quality of ParMETIS
comes at the cost of runtime that increases with the number of nodes. PT-SCOTCH ordering is the fastest and is
the only one that gets meaningful speedups. However, The factorization operation count of ordering generated
by PT-SCOTCH is consistently worse than that of PWSMP by roughly 20%.
Acknowledgements
This work was supported by King Abdullah University of Science and Technology (KAUST).
References
[1] Patrick R. Amestoy, Iain S. Duff, Jacko Koster, and J. Y. L’Excellent. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications,
23(1):15–41, 2001.
[2] Patrick R. Amestoy, Iain S. Duff, and J. Y. L’Excellent. Multifrontal parallel distributed symmetric and
unsymmetric solvers. Computational Methods in Applied Mechanical Engineering, 184:501–520, 2000.
12
3
15
2
10
1
5
16
64
256
Time
(left scale)
1024
16
64
256
Opcount
(left scale)
1024
16
Number of Failures
Relative Time and Quality of Ordering
ParMETIS
PT−SCOTCH
PWSMP (uncompressed)
PWSMP (compressed)
64 256 1024
Failures
(right scale)
Number of BG/P Nodes Used (P)
Figure 3: A comparison of the speed, quality, and robustness of ParMETIS, PT-SCOTCH, and PWSMP orderings. The time and operation count of factorization is relative to that of uncompressed PWSMP ordering.
[3] C. Chevalier and F. Pellegrini. PT-Scotch: A tool for efficient parallel graph ordering. Parallel Computing,
34(6-8):318–331, 2008.
[4] Timothy A. Davis. The university of Florida sparse matrix collection. Technical report, Department of
Computer Science, University of Florida, Jan 2007.
[5] Alan George. Nested dissection of a regular finite-element mesh. SIAM Journal on Numerical Analysis,
10:345–363, 1973.
[6] Anshul Gupta. WSMP: Watson sparse matrix package (Part-I: Direct solution of symmetric sparse systems). Technical Report RC 21886, IBM T. J. Watson Research Center, Yorktown Heights, NY, November
2000. http://www.cs.umn.edu/˜agupta/wsmp.
[7] George Karypis and Vipin Kumar. ParMETIS: Parallel graph partitioning and sparse matrix ordering
library. Technical Report TR 97-060, Department of Computer Science, University of Minnesota, 1997.
[8] Gary Lakner and Carlos P. Sosa. IBM System Blue Gene Solution: Blue Gene/P Application Development.
IBM, 2008. http://www.redbooks.ibm.com/abstracts/sg247287.html.
[9] Jürgen Schulze. Towards a tighter coupling of bottom-up and top-down sparse matrix ordering methods.
Bit Numerical Mathematics, 41(4):800–841, 2001.
13
[10] Chris Walshaw and M. Cross. JOSTLE: Parallel multilevel graph-partitioning software—an overview. In
F. Magoules, editor, Mesh Partitioning Techniques and Domain Decomposition Techniques, pages 27–58.
Civil-Comp Ltd., 2007.
14
Similar pages