RC 25008 (W1006-029) June 9, 2010 Computer Science/Mathematics IBM Research Report An Evaluation of Parallel Graph Partitioning and Ordering Softwares on a Massively Parallel Computer Anshul Gupta IBM T. J. Watson Research Center 1101 Kitchawan Road Yorktown Heights, NY 10598 [email protected] Research Division Austin China Haifa India Tokyo Watson Zurich IBM Almaden An Evaluation of Parallel Graph Partitioning and Ordering Softwares on a Massively Parallel Computer Anshul Gupta Business Analytics and Mathematical Sciences IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA. [email protected] IBM Research Report RC 25008 (W1006-029) June 9, 2010 Abstract We empirically study state-of-the-art parallel graph partitioning and sparse matrix ordering software packages. We compare their speed, quality, and robustness. For a model case, in which good partitionings (even optimal partitionings, in some cases) can be constructed manually, we compare the size of the edge cuts of the manual partitions with that of the partitions generated by the multilevel heuristics that are at the heart of modern graph partitioning software packages. We show that the quality of the partitions generated by the software is only slightly worse than that of the manual partition for this class of model graphs. We discuss the shortcomings of the current ordering software and argue that there is an urgent need for more robust, scalable, and high-quality software for sparse matrix ordering to support scalable solution of sparse linear systems by direct methods on massively parallel computers. 1 Introduction This report contains the results of a concise experimental study of two state-of-the-art parallel graph partitioning software packages and three parallel sparse matrix ordering packages on the Blue Gene/P [8]. For our study, we use a set of a dozen graphs derived from symmetric sparse matrices of moderate sizes. These matrices, listed in Table 1, were obtained from the University of Florida collection [4]. We are aware of three distributed-memory parallel software packages for graph partitioning, namely ParMETIS [7], PT-SCOTCH [3], and JOSTLE [10]. Since neither the source code, nor binaries compiled for our target machine, BG/P, are available for JOSTLE, we restricted the study to ParMETIS and PT-SCOTCH. We are aware of another parallel sparse matrix ordering software known as PORD [9], but did not include it in the study. PORD does not contain graph partitioning software, and based on our earlier experience with MUMPS [1, 2] (a parallel sparse linear system solver package that uses PORD ordering as the default), the quality of the orderings produced by Parallel Watson Sparse Matrix Package (WSMP) [6] is generally better. Therefore, we have included ordering results from the latter for comparison with ParMETIS and PT-SCOTCH. We use Version 9.9 of WSMP, Version 3.2 of ParMETIS, and Version 5.2.9. 1 Matrix Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso N 1,073,724 5,978,665 943,695 1,585,478 665,017 902,289 1,508,088 72,000 1,522,431 1,371,166 1,2281,045 201,142 NNZ 84,317,460 29,640,547 76,651,847 7,660,826 107,514,163 70,925,391 111,614,436 28,715,634 125,755,875 108,009,680 8,580,313 3,161,120 Application/Origin Static stress analysis Power network analysis Automotive crankshaft modeling Circuit simulation Structural analysis Metal stamping Structural analysis 3D mesh problem Structural analysis 3-D metal forming Steady state thermal problem Human torso modeling Table 1: SPD test matrices with their order (N) and number of non-zeros (NNZ). 2 Graph Partitioning Results In this section, we compare ParMETIS and PT-SCOTCH for graph partitioning. Both these packages employ multi-level heuristics. In multi-level partitioning, a graph is first coarsened by successively finding matchings, and then eliminating the matched edge by coalescing the vertices that it connects. After sufficient number of coarsening steps, when the graphs has been reduced to a size below a predetermined threshold, it is partioned. Finally, the coarsening is undone one level at a time, while refining the partitioning at each step. When the graph is fully uncoarsened to its original state, the partitioning is complete after the refinement step. More detailed descriptions of the parallel multilevel partitioning heuristics can be found in the relevant literature [3, 7]. A key difference between the partitioning strategies of ParMETIS and PT-SCOTCH is that ParMETIS performs only one cycle of coarsening and refining irrespective of the number of partitions desired, whereas PT-SCOTCH performs cycles of coarsening and refining to partition a graph into parts. This is because ParMETIS partitions the coarsest graph into parts and refines the complete partitioning at each uncoarsening step. PT-SCOTCH computes only a bisection or a 2-way partition at a time. It then bisects the steps. partitions recursively until the desired number of partitions is computed. Clearly, this entails 2.1 Comparison of partitioning quality and speed Tables 2–5 show the results of partitioning the graphs in our test suite on 16, 64, 256, and 1024 BG/P nodes. In each table, we report the partitioning time and quality of ParMETIS and PT-SCOTCH for partitioning the graphs into 2, , and parts, where is the number of nodes. We measure and report two metrics for quality of the partitioning. These are the total edge cut or the number of edges that connect vertices belonging to different partitions, and the maximum imbalance or the percentage by which the size of the largest partition exceeds the average partition size. Both partitioner accept a use defined threshold of acceptable maximum imbalance. We used a threshold of 5% in all our experiments. Default values are used for all other parameters for both the packages. A hyphen in the tables indicates a failure of the configuration corresponding to that entry, either because of the partitioner crashing or running out of memory. For a ready comparison of the two packages, the 2 Number of Partitions Matrices Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso Geometric mean Average Geometric mean ParMETIS 2 PT-SCOTCH 2 425039 4.47% 2.94 4081 0.14% 6.67 106853 1.03% 9.56 1994 0.00% 1.56 154301 0.99% 3.00 382815 0.00% 2.46 103517 0.63% 16.1 795640 2.37% 3.02 134640 1.60% 22.0 348777 2.76% 4.19 1087 0.01% 1.42 10440 4.43% 0.38 1984596 4.39% 3.03 26618 2.16% 6.81 1357326 4.66% 10.4 14039 3.68% 1.60 2481785 5.00% 3.87 1834896 4.78% 2.63 973744 4.99% 16.9 3061013 5.02% 4.21 1591912 4.99% 23.2 2175602 4.57% 4.56 13995 3.54% 1.51 54662 4.85% 0.41 2720681 4.78% 3.29 38260 3.92% 6.86 2027681 4.98% 10.9 19295 4.88% 1.66 3962014 5.03% 4.26 2458087 4.82% 2.89 1499380 4.99% 17.4 3899140 5.33% 4.64 2558298 4.87% 24.0 3182763 4.87% 4.82 22310 3.99% 1.55 80629 5.00% 0.44 2680 0.00% 5.86 107217 0.85% 9.24 1244 1.80% 1.30 145761 1.46% 4.20 379260 0.11% 4.53 86274 1.27% 14.5 701355 2.00% 4.88 126252 1.98% 19.1 355428 1.09% 5.41 1054 1.19% 0.98 9772 1.99% 0.51 15916 2.54% 16.0 1367289 4.39% 30.0 9914 3.44% 3.80 2298571 5.25% 18.1 1761246 4.86% 14.6 899577 4.90% 43.5 3019536 6.49% 13.4 1509588 5.92% 54.9 2140359 4.61% 20.7 13080 4.05% 2.81 52758 5.73% 1.62 24537 2.91% 19.8 2072169 5.17% 40.7 15152 3.88% 4.76 3975585 6.30% 34.0 2446541 3.84% 21.3 1491046 5.95% 56.8 4067957 7.56% 14.3 2466981 6.96% 70.7 3096725 5.36% 34.1 20408 4.30% 3.34 79912 6.80% 2.43 Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) 46608 1.27% 3.70 370839 4.39% 4.09 544540 4.79% 4.32 41016 1.25% 4.06 331369 4.74% 12.8 506780 5.37% 17.4 Edges cut Max. imbalance Time (seconds) Table 2: Partitioning statistics for ParMETIS and PT-SCOTCH on 16 processes ( 3 ). bottom of each table contains geometric means of the edge cut, maximum imbalance, and partitioning time for those graphs for which none of the partitioners failed. The comparison in Table 2 shows that the quality of PT-SCOTCH partitions is better than those of ParMETIS on 16 nodes. As expected, PT-SCOTCH takes longer to compute 16 or 32 partitions because of its recursive bisection strategy, unlike the single step multiway partitioning strategy used in ParMETIS. The results in Table 3 for 64 nodes are similar to those in Table 2. A notable difference, however, is that the maximum imbalance in PT-SCOTCH partitioning into 64 and 128 parts exceeds the 5% threshold. The cause of growing imbalance in PT-SCOTCH lies in its recursive bisection strategy. PT-SCOTCH treats each bisection as an independent partitioning problem. A given bisectioning problem does not take into account whether it is partitioning the smaller or the larger partition from the previous bisection. Ideally, the balancing requirements on the larger partition need to be more stringent in order to guarantee that the imbalance in the final partitioning does not exceed the user defined threshold. In the absence of an adaptive management of imbalance in each bisection computation, the imbalance gets compounded at each level of bisection. Tables 4 and 5 show that this phenomenon gets more pronounced as the number of nodes and partitions increases. Although the size of the edge cuts produced by PT-SCOTCH remains smaller than those produced by ParMETIS, the high imbalance renders PT-SCOTCH partitioning into more than 64 parts impractical for real applications. PT−SCOTCH edge−cut to ParMETIS edge−cut ratio (left scale) ParMETIS percentage imbalance / 5% (left scale) PT−SCOTCH percentage imbalance / 5% (left scale) 4 ParMETIS partitionioning time (right scale) 15 s PT−SCOTCH partitioning time (right scale) 3 10 s 2 Partitining Time Partitining Quality Metrics 5s 1 16 64 256 1024 Number of BG/P Nodes Used (P) Figure 1: ParMETIS and PT-SCOTCH comparison when the number of partitions is the same as the number of MPI processes. Figure 1 contains a graphical depiction of the geometric mean data from Tables 2–5. Note that for 1024 partitions on 1024 MPI processes, the geometric mean of the imbalance for ParMETIS too seems to go beyond 4 Number of Partitions Matrices Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso Geometric mean Average Geometric mean ParMETIS 2 PT-SCOTCH 2 446459 1.84% 1.18 4036 0.03% 2.05 109180 0.50% 2.58 1874 2.27% 0.61 135529 0.43% 1.28 384021 0.85% 1.08 88479 0.76% 6.37 695354 1.60% 6.37 131417 1.46% 7.05 365490 0.54% 1.64 1290 0.00% 0.50 11320 2.83% 0.20 3665634 4.94% 1.38 48122 3.51% 2.22 2965230 4.99% 3.13 33245 4.54% 0.67 5751471 5.00% 1.82 3244471 4.98% 1.20 2225284 4.99% 7.11 5303531 5.42% 9.87 3984616 5.00% 7.99 4377504 4.65% 1.90 32837 4.53% 0.56 114436 4.61% 0.24 4802692 4.87% 1.48 64533 4.53% 2.29 4162268 5.00% 3.38 53040 4.28% 0.70 8079937 5.27% 2.08 4313942 4.99% 1.29 3263959 5.01% 7.50 6993334 7.56% 12.1 5717904 5.00% 8.42 5672840 4.99% 2.01 49497 4.46% 0.60 156181 4.94% 0.25 427831 0.42% 3.39 2414 0.90% 2.61 108981 1.03% 3.46 1333 0.05% 0.50 129529 0.96% 2.59 376713 0.31% 3.00 90180 1.47% 5.52 689704 2.00% 2.79 133236 0.33% 6.10 354393 0.63% 3.32 1077 0.46% 0.34 10126 1.14% 0.31 3660252 10.1% 13.3 32241 4.70% 10.4 2986228 8.85% 16.1 22183 5.47% 2.16 5780260 8.59% 15.9 3195067 7.40% 11.4 2253226 7.99% 23.3 5404439 11.6% 7.93 3887558 8.74% 25.2 4337095 7.15% 17.1 32545 6.93% 1.62 113099 8.37% 1.25 4996580 11.1% 16.4 47685 5.68% 11.3 4241224 9.46% 18.9 36881 5.51% 2.58 8195493 9.69% 18.8 4453477 7.97% 14.0 3255261 9.07% 28.0 6979632 12.7% 8.17 5747958 9.83% 30.8 5725184 7.75% 20.8 47185 7.59% 1.81 157246 9.51% 1.44 Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) 55357 1.09% 1.58 891198 4.76% 1.88 1243358 5.08% 2.03 49832 0.81% 1.99 831423 7.99% 8.49 1183434 8.82% 9.92 Edges cut Max. imbalance Time (seconds) Table 3: Partitioning statistics for ParMETIS and PT-SCOTCH on 64 processes ( 5 ). Number of Partitions Matrices Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso Geometric mean Average Geometric mean ParMETIS 2 PT-SCOTCH 2 ! 447753 0.95% 0.41 3145 0.00% 1.17 112311 0.26% 1.08 1878 0.00% 0.60 144141 0.11% 0.50 412146 0.29% 0.34 96904 0.14% 2.32 730119 1.29% 6.33 135873 1.21% 2.80 366462 1.07% 0.53 1129 0.00% 0.59 13509 1.57% 0.29 6252550 5.00% 0.56 84722 4.55% 1.38 5710116 5.31% 1.40 78502 3.74% 0.69 10920589 5.25% 0.85 5508116 5.12% 0.49 4773917 5.16% 2.28 8684320 10.9% 10.4 7651743 5.03% 3.38 7319344 5.02% 0.72 70447 4.69% 0.71 213203 4.87% 0.35 7927361 5.00% 0.68 118210 4.86% 1.48 7524868 5.31% 1.57 105678 3.43% 0.70 14247071 5.26% 1.07 7063156 5.20% 0.60 6611421 5.28% 3.06 11333968 27.3% 11.1 10200306 5.70% 3.72 9400789 5.19% 0.89 100101 4.43% 0.71 281944 5.13% 0.43 425736 0.82% 1.98 6435014 13.2% 7.62 8085763 14.0% 8.30 128457 2.00% 2.47 346923 1.37% 2.11 1126 2.00% 0.21 9941 0.21% 0.25 7729759 15.7% 12.5 7489539 15.0% 10.3 71706 10.3% 1.10 214744 10.1% 1.15 10351402 16.9% 13.4 9753346 15.0% 11.3 99916 10.9% 1.17 285206 11.2% 1.19 Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) 52173 0.61% 0.72 1581794 4.88% 0.89 2100704 4.96% 1.04 46672 1.16% 1.00 1541061 11.8% 4.90 2068988 12.4% 5.15 Edges cut Max. imbalance Time (seconds) 104535 0.84% 1.58 1332 1.92% 0.30 5790781 12.6% 8.98 55433 9.05% 1.49 7707901 12.6% 9.56 80242 9.57% 1.62 384759 1.02% 1.86 92358 0.29% 2.11 5700756 10.6% 8.54 4669752 10.1% 11.4 7317206 10.7% 7.55 6598890 11.1% 12.4 Table 4: Partitioning statistics for ParMETIS and PT-SCOTCH on 256 processes ( 6 " ). PT−SCOTCH edge−cut to ParMETIS edge−cut ratio (left scale) ParMETIS percentage imbalance / 5% (left scale) 15 s PT−SCOTCH percentage imbalance / 5% (left scale) ParMETIS partitionioning time (right scale) 3 PT−SCOTCH partitioning time (right scale) 10 s 2 Partitining Time Partitining Quality Metrics 4 5s 1 16 64 256 1024 Number of BG/P Nodes Used Figure 2: ParMETIS and PT-SCOTCH comparison for graph bisection. 5%. However, a closer look at Table 5 would reveal that the statistics are skewed by a single bad case: graph nd24k, which has only 72000 vertices and would have very few vertices per partition in a 1024-way partitioning. Figure 2 shows similar statistics for graph bisection (2-way partition) on number of MPI processes increasing from 16 to 1024. This figure, along with Table 2, shows that partition quality obtained by PT-SCOTCH is slightly better than that obtained from ParMETIS for small number of partitions (irrespective of the number of MPI processes). 2.2 Heuristic versus manual partitions for a model problem In this section, we consider partitioning regular unweighted 3-D graphs with a 7-point cubic stencil, where each vertex is connected to its immediate neighbors in both directions along # , $ , and % axes. Tables 6 and 7 compare the edge-cuts of partitions generated by ParMETIS with balanced manual partitions created simply by introducing cutting planes along each dimension to maintain load balance and to maintain the best possible aspect ratio of of the dimensions of the partitions. It can be observed that the ratio of the sizes of cuts produced by ParMETIS to those generated manually is close to 1.5 for a wide range of graph sizes, partition sizes, and number of MPI processes. The ratio deteriorates slightly as the size of the graph increases, and improves very slightly as the number of partitions increases. In fact, a close observation reveals that the ratio is almost constant for a given average number of vertices per partition. Table 7 shows that when the number of MPI processes is kept constant at 32 and the number of 7 Number of Partitions Matrices Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso Geometric mean Average Geometric mean ParMETIS 2 PT-SCOTCH 2 ! 475753 0.80% 0.32 3773 0.55% 2.30 109746 0.11% 0.81 2156 0.00% 1.26 136665 0.48% 1.77 413838 0.72% 0.45 98559 0.04% 7.69 723707 1.34% 6.55 145908 1.24% 9.44 380961 0.03% 0.46 1212 0.01% 2.05 12347 0.71% 0.52 10074683 5.29% 0.57 171165 4.80% 2.59 9745613 5.58% 1.15 140353 4.24% 1.04 18519743 7.02% 1.50 8987618 5.54% 0.78 9004369 5.65% 8.06 12348960 39.4% 10.9 13429147 5.87% 8.80 11941106 5.37% 0.84 141401 4.31% 2.32 365929 5.38% 0.72 12667459 5.86% 0.96 239776 4.68% 2.87 12534227 7.21% 2.36 188705 5.92% 1.49 23575523 7.48% 2.84 11280209 7.36% 1.07 12076573 6.47% 8.11 13154160 39.4% 11.0 17272737 6.41% 10.1 15030032 5.75% 1.38 197978 4.06% 2.42 469777 5.89% 1.01 419560 1.56% 1.80 2458 0.49% 1.81 113211 1.46% 1.26 1483 2.00% 0.58 130270 1.00% 1.15 371628 0.89% 1.73 88893 0.78% 1.44 694100 2.06% 2.70 132912 0.59% 1.71 359883 1.48% 1.70 1102 0.93% 0.51 10310 0.39% 0.42 10490912 14.4% 6.16 143195 9.34% 5.73 9954688 14.8% 6.32 114184 9.47% 1.85 18840372 16.4% 7.11 9277671 14.1% 5.47 8962402 16.0% 7.31 11994364 18.0% 6.40 13698648 13.6% 7.86 12312997 14.5% 7.47 148107 11.8% 1.44 369946 14.0% 1.47 13142764 15.8% 6.20 206839 9.24% 5.85 12761406 16.1% 6.36 161962 9.93% 1.89 24082589 17.6% 7.17 11628226 16.9% 5.47 12093880 17.2% 7.42 13097836 22.3% 6.42 17561488 14.9% 7.95 15521890 15.2% 7.61 203343 11.9% 1.47 475324 15.1% 1.47 Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) Edges cut Max. imbalance Time (seconds) 57919 0.50% 1.51 2905685 8.20% 1.90 3733212 8.87% 2.59 50590 1.14% 1.23 2857623 13.9% 4.61 3699231 15.2% 4.67 Edges cut Max. imbalance Time (seconds) Table 5: Partitioning statistics for ParMETIS and PT-SCOTCH on 1024 processes ( 8 & ). ' partitions is increased, then the ratio improves somewhat faster than what we observe in Table 6. This shows that the partition quality declines very slightly with increasing number of MPI processes. Cube (( ), Graph ((*) ) Dimensions (,+ (*)-+ (,+ (*)-+ (,+ (*)-+ (,+ (*)-+ 128 2097152 256 16777216 448 89915392 576 191102976 32-way 64-way 128-way 256-way 512-way 1024-way 168722 114688 1.47 692838 458752 1.51 2119976 1404928 1.51 3543317 2322432 1.53 232199 *147456 1.57 941707 *589824 1.60 2951874 *1806336 1.63 4946173 *2985984 1.66 305206 212992 1.43 1264127 851968 1.48 3982187 2609152 1.53 6743418 4313088 1.56 396226 278528 1.42 1665213 1114112 1.49 5207300 3411968 1.53 8787701 5640192 1.56 508356 *344064 1.48 2137405 *1376256 1.55 6758690 *4214784 1.60 11267256 *6967296 1.62 635988 475136 1.34 2722783 1900544 1.43 8601565 5820416 1.48 14477239 9621504 1.50 Metis Manual Ratio Metis Manual Ratio Metis Manual Ratio Metis Manual Ratio Table 6: A comparison of the best (smallest possible) and ParMETIS edge-cuts for unweighted 3-D graphs with a 7-point cubic stencil. The number of BG/P nodes used is the same as the number of partitions. A ‘*’ indicates an optimal partition. Cube (( ), Graph ((*) ) Dimensions (.+ (*)-+ 256 16777216 32-way 64-way 128-way 256-way 512-way 1024-way 692838 458752 1.51 918806 *589824 1.56 1216633 851968 1.43 1589311 1114112 1.43 2036719 *1376256 1.48 2577758 1900544 1.36 Metis Manual Ratio Table 7: A comparison of the best (smallest possible) and ParMETIS edge-cuts for unweighted 3-D graphs with a 7-point cubic stencil. 32 BG/P nodes are used for each partition. A ‘*’ indicates an optimal partition. 2.3 Conclusions from partitioning experiments The results in Sections 2.1 indicate that PT-SCOTCH outperforms ParMETIS in terms of partition quality for a small number of partitions. However, for a large number of partitions, ParMETIS is the only practical option. The results in Section 2.2 indicate that, at least for a 3-D model problem, the quality of the partitions generated by the parallel multilevel heuristics is worse than the optimal partitioning by only a small factor, and that this factor stays more or less constant over a wide range of graph sizes, MPI processes, and number of partitions. 9 ParMETIS PT-SCOTCH Time Opcount Time Opcount PWSSMP uncompressed Time Opcount Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso 32.7 36.8 44.8 10.1 41.7 30.4 59.7 71.2 77.1 42.7 8.64 6.34 4.21e+13 1.31e+11 6.02e+12 6.46e+10 4.09e+12 2.80e+13 3.20e+12 2.04e+12 8.98e+12 4.98e+13 1.64e+10 1.40e+11 33.0 43.2 10.2 34.3 25.2 66.3 14.1 80.7 36.3 5.61 2.45 2.43e+11 5.99e+12 6.84e+10 4.05e+12 1.98e+13 3.16e+12 2.11e+12 8.71e+12 3.59e+13 1.77e+10 1.44e+11 228. 68.9 68.6 17.7 88.1 154. 87.6 131. 107. 179. 12.8 5.50 2.34e+13 1.52e+11 5.05e+12 6.94e+10 3.11e+12 1.62e+13 2.70e+12 2.06e+12 7.40e+12 2.91e+13 1.75e+10 1.27e+11 83.4 72.0 19.6 18.9 24.4 27.8 29.5 143. 34.1 41.0 14.4 6.44 2.31e+13 1.57e+11 5.20e+12 6.95e+10 3.27e+12 1.62e+13 3.76e+12 1.97e+12 7.84e+12 2.90e+13 1.76e+10 1.27e+11 Geom. mean 29.8 1.40e+12 21.6 1.41e+12 56.2 1.12e+12 28.4 1.25e+12 Matrices PWSSMP compressed Time Opcount Table 8: Ordering statistics for ParMETIS, PT-SCOTCH, and PWSSMP on 16 processes ( ). 3 Parallel Sparse Matrix Ordering In this section, we compare the time and quality of parallel fill-reducing ordering produced by ParMETIS, PTSCOTCH, and parallel WSMP on the set of matrices described in Table 1. ParMETIS and PT-SCOTCH use a distributed multilevel strategy to compute vertex separators of the graph of the matrix to generate a nesteddissection [5] type permutation of the vertices. On the contrary, parallel WSMP generates a vertex separator sequentially. It first computes a node bisector of the entire graph on a single MPI process. The two subgraphs are subsequently processed independently in parallel and the process continues recursively. Note that there are two potential disadvantages of this approach. First, only limited speedup from parallelism can be expected because a significant amount of computation that goes into finding the first separator is performed sequentially, and the parallelism increases gradually as the size of the subgraphs decreases. The second problem is that this approach requires the entire graph to be stored on a single process to compute the first separator, and hence is not scalable with respect to memory, although this may not be a problem for highly compressible graphs (see next paragraph). Some of the matrices in our test suite have graphs with multiple degrees of freedom per node. Parallel WSMP ordering can automatically detect this and take this into account to reduce memory and time requirements of ordering. On the other hand, the user must explicitly supply compressed graph information to ParMETIS and PT-SCOTCH. In our experiments, we have used ParMETIS and PT-SCOTCH with the original uncompressed graphs, and Parallel WSMP with the graph compression option turned both off and on. Default values for all other parameters were used with the exception of matching type for ParMETIS, which was set to PARMETIS MTYPE GLOBAL. Setting this to the default value of PARMETIS MTYPE LOCAL resulted in some improvement in runtime and some deterioration in the quality of ordering. Tables 8–11 show the parallel ordering results for the four cases on 16 to 1024 BG/P nodes. The last row of each table shows that geometric means of the column values for all those matrices that all the packages were able to reorder. Figure 3 10 ParMETIS PT-SCOTCH Time Opcount Time Opcount PWSSMP uncompressed Time Opcount Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso 44.4 20.3 66.6 16.0 80.1 51.6 58.9 99.0 71.8 47.6 10.5 16.1 3.47e+13 1.24e+11 5.55e+12 6.21e+10 3.96e+12 2.21e+13 2.89e+12 2.03e+12 7.91e+12 4.40e+13 1.71e+10 1.19e+11 17.5 17.1 7.18 17.2 14.6 27.2 8.52 30.2 20.5 2.44 1.48 4.08e+13 5.86e+12 6.71e+10 4.08e+12 1.93e+13 3.27e+12 2.10e+12 8.37e+12 4.99e+13 1.82e+10 1.49e+11 140. 58.0 92.3 14.4 77.4 141. 84.3 130. 97.8 209. 11.5 4.93 2.29e+13 1.46e+11 5.07e+12 7.10e+10 3.23e+12 1.61e+13 2.78e+12 2.03e+12 7.33e+12 2.91e+13 1.72e+10 1.33e+11 76.4 61.3 18.8 15.7 23.3 27.2 27.2 129. 31.4 39.6 13.0 5.87 2.28e+13 1.46e+11 5.17e+12 7.09e+10 3.19e+12 1.62e+13 3.68e+12 2.10e+12 7.62e+12 2.88e+13 1.74e+10 1.33e+11 Geom. mean 41.7 2.14e+12 11.0 2.31e+12 58.2 1.90e+12 26.6 1.97e+12 Matrices PWSSMP compressed Time Opcount Table 9: Ordering statistics for ParMETIS, PT-SCOTCH, and PWSSMP on 64 processes ( ParMETIS PT-SCOTCH Time Opcount Time Opcount PWSSMP uncompressed Time Opcount Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso 189. 45.6 207. 39.6 272. 175. 173. 103. 215. 169. 28.8 18.3 2.67e+13 1.07e+11 5.06e+12 6.16e+10 3.06e+12 1.61e+13 2.61e+12 2.04e+12 7.10e+12 3.96e+13 1.66e+10 1.20e+11 8.93 8.29 5.03 8.24 7.84 11.1 6.91 13.3 10.8 1.38 1.30 2.57e+13 5.75e+12 7.17e+10 3.79e+12 2.74e+13 3.48e+12 2.10e+12 8.81e+12 3.60e+13 1.64e+10 1.38e+11 212. 58.4 90.0 14.1 73.2 139. 83.8 140. 132. 202. 11.1 4.88 2.34e+13 1.35e+11 4.94e+12 6.85e+10 3.08e+12 1.62e+13 2.78e+12 2.10e+12 7.54e+12 2.88e+13 1.73e+10 1.29e+11 72.2 61.4 18.7 15.4 23.1 26.4 27.4 150. 31.9 40.1 12.6 5.82 2.33e+13 1.36e+11 5.17e+12 6.83e+10 3.25e+12 1.62e+13 3.61e+12 2.10e+12 7.59e+12 2.87e+13 1.68e+10 1.29e+11 Geom. mean 109. 1.91e+12 6.15 2.20e+12 61.4 1.89e+12 26.7 1.95e+12 Matrices PWSSMP compressed Time Opcount Table 10: Ordering statistics for ParMETIS, PT-SCOTCH, and PWSSMP on 256 processes ( 11 " ). ). ParMETIS PT-SCOTCH Time Opcount Time Opcount PWSSMP uncompressed Time Opcount Algor-big Alsim-b Audikw 1 G3 circuit Lmco Mstamp-2c Nastran-b Nd24k Sgi 1M Ten-b Thermal2 Torso 255. 142. 248. 62.5 212. 327. 389. 298. 60.3 18.7 2.26e+13 1.03e+11 4.93e+12 5.15e+10 1.60e+13 2.54e+12 7.13e+12 2.91e+13 1.32e+10 1.20e+11 7.23 10.2 5.33 4.65 5.91 6.69 6.67 7.76 7.83 1.65 1.58 3.38e+13 1.81e+11 5.54e+12 8.41e+10 3.67e+12 1.74e+13 3.27e+12 9.25e+12 3.27e+13 1.78e+10 1.33e+11 169. 59.6 64.6 14.7 74.8 141. 75.9 139. 95.5 148. 11.4 5.16 2.29e+13 1.34e+11 4.97e+12 6.57e+10 3.00e+12 1.61e+13 2.64e+12 2.03e+12 7.16e+12 2.86e+13 1.71e+10 1.28e+11 79.5 62.6 19.8 16.1 23.2 27.5 27.9 144. 32.5 41.1 12.9 6.13 2.31e+13 1.34e+11 5.11e+12 6.54e+10 3.19e+12 1.62e+13 3.48e+12 2.10e+12 7.54e+12 2.86e+13 1.68e+10 1.28e+11 Geom. mean 148. 1.23e+12 5.14 1.61e+12 49.3 1.34e+12 25.8 1.39e+12 Matrices PWSSMP compressed Time Opcount Table 11: Ordering statistics for ParMETIS, PT-SCOTCH, and PWSSMP on 1024 processes ( & ). summarizes these results graphically. The results in Tables 8–11 and Figure 3 show that all ordering packages have limitations. PWSMP is robust (no failures in any of the tests), and generates good quality orderings in reasonable time; however, as stated earlier, its parallel ordering strategy is not scalable in terms of either ordering time or memory. Having said that, it seems to have a fairly memory-frugal implementation relative to PT-SCOTCH (which runs out of memory for one test case on 16 nodes) and ParMETIS (which runs out of memory for two test cases on 1024 nodes). In terms of ordering quality, PWSMP without graph compression generates orderings with the least operation count on 16 and 64 nodes. ParMETIS catches up as the number of nodes increase, matching PWSMP’s quality on 256 nodes and exceeding it on 1024. However, this improvement in quality of ParMETIS comes at the cost of runtime that increases with the number of nodes. PT-SCOTCH ordering is the fastest and is the only one that gets meaningful speedups. However, The factorization operation count of ordering generated by PT-SCOTCH is consistently worse than that of PWSMP by roughly 20%. Acknowledgements This work was supported by King Abdullah University of Science and Technology (KAUST). References [1] Patrick R. Amestoy, Iain S. Duff, Jacko Koster, and J. Y. L’Excellent. A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications, 23(1):15–41, 2001. [2] Patrick R. Amestoy, Iain S. Duff, and J. Y. L’Excellent. Multifrontal parallel distributed symmetric and unsymmetric solvers. Computational Methods in Applied Mechanical Engineering, 184:501–520, 2000. 12 3 15 2 10 1 5 16 64 256 Time (left scale) 1024 16 64 256 Opcount (left scale) 1024 16 Number of Failures Relative Time and Quality of Ordering ParMETIS PT−SCOTCH PWSMP (uncompressed) PWSMP (compressed) 64 256 1024 Failures (right scale) Number of BG/P Nodes Used (P) Figure 3: A comparison of the speed, quality, and robustness of ParMETIS, PT-SCOTCH, and PWSMP orderings. The time and operation count of factorization is relative to that of uncompressed PWSMP ordering. [3] C. Chevalier and F. Pellegrini. PT-Scotch: A tool for efficient parallel graph ordering. Parallel Computing, 34(6-8):318–331, 2008. [4] Timothy A. Davis. The university of Florida sparse matrix collection. Technical report, Department of Computer Science, University of Florida, Jan 2007. [5] Alan George. Nested dissection of a regular finite-element mesh. SIAM Journal on Numerical Analysis, 10:345–363, 1973. [6] Anshul Gupta. WSMP: Watson sparse matrix package (Part-I: Direct solution of symmetric sparse systems). Technical Report RC 21886, IBM T. J. Watson Research Center, Yorktown Heights, NY, November 2000. http://www.cs.umn.edu/˜agupta/wsmp. [7] George Karypis and Vipin Kumar. ParMETIS: Parallel graph partitioning and sparse matrix ordering library. Technical Report TR 97-060, Department of Computer Science, University of Minnesota, 1997. [8] Gary Lakner and Carlos P. Sosa. IBM System Blue Gene Solution: Blue Gene/P Application Development. IBM, 2008. http://www.redbooks.ibm.com/abstracts/sg247287.html. [9] Jürgen Schulze. Towards a tighter coupling of bottom-up and top-down sparse matrix ordering methods. Bit Numerical Mathematics, 41(4):800–841, 2001. 13 [10] Chris Walshaw and M. Cross. JOSTLE: Parallel multilevel graph-partitioning software—an overview. In F. Magoules, editor, Mesh Partitioning Techniques and Domain Decomposition Techniques, pages 27–58. Civil-Comp Ltd., 2007. 14

- Similar pages