# Intel® Math Kernel Library (Intel® MKL) Parallel Direct Sparse Solver for Clusters

Alexander Kalinkin Anton Anders Roman Anders

# **Legal Disclaimer**

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference <a href="https://www.intel.com/software/products">www.intel.com/software/products</a>.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Intel Inside, Intel Inside, Intel Inside, Intel Inside Intel Inside Inside Inside Intel Inside Inside Inside Inside Inside Inside Inside Inside, Intel Inside Inside, Intel Inside, Inside, Inside, Inside, Inside, Intel Inside, Intel Inside, Inside, Inside, Inside, Inside, Intel Inside, Intel Inside, Inside, Inside, Inside, Intel Insi

\*Other names and brands may be claimed as the property of others.

Copyright © 2014. Intel Corporation.

http://intel.com/software/products





# **Agenda**

- Problem statement
- Algorithm description
- Interface description
- Experiments
- Conclusions



### **Problem statement**



#### ✓ Cons

- No extra info available on the Clusters with modern CPUs matrix, only few generic properties (positive definite, Hermitian,...)
- Huge size

#### ✓ Pros

- Intel® Math Kernel Library (Intel® MKL) with optimized BLAS, LAPACK, PARDISO functionality



# Algorithm (Ax=b)

Matrix reordering and symbolic factorization



Matrix reordering and symbolic factorization



Matrix reordering and symbolic factorization



# Algorithm (Ax=b)







# Algorithm (Ax=b)







# **Factorization step**

Matrix A after reordering (example of 4 leafs/processes)

|   | ^      | П |            | <u> </u> | _ |          |    |
|---|--------|---|------------|----------|---|----------|----|
|   | Α      | В | С          | D        | Е | F        | G  |
| Е |        |   |            |          |   |          |    |
| В |        |   |            |          |   |          |    |
| С | i<br>i |   | <b>^</b> ^ |          |   |          |    |
| D |        |   |            |          |   |          |    |
| Е |        |   |            |          |   |          |    |
| F |        |   |            |          |   | <b>↑</b> |    |
| G |        | • |            | 14       |   | Λ'.<br>- | 2> |



- - > - L-block updates R-block (or Right depends on Left) Tree representation of matrix A after reordering



- Both tree and tree-node parallelization are used
- All computations within the node are based on functionality from Intel MKL
- Computation of leafs & updates of a block are independent on each process
- Data is distributed between processes uniformly





### **Factorization step**

Matrix A after reordering (example of 4 leafs/process)

|   | Α | В | С          | D        | Е   | F          | G |  |  |
|---|---|---|------------|----------|-----|------------|---|--|--|
| Е |   |   |            |          |     |            |   |  |  |
| В |   |   |            |          |     |            |   |  |  |
| С | • |   | <b>^</b> ^ |          |     |            |   |  |  |
| D |   |   |            |          |     |            |   |  |  |
| Е |   |   |            |          |     |            |   |  |  |
| F |   |   |            | <u> </u> | 1 8 | <b>^</b> ↑ |   |  |  |
| G |   |   |            |          |     |            | 2 |  |  |



- - > - L-block updates R-block (or Right depends on Left) Tree representation of matrix A after reordering



- Both tree and tree-node parallelization are used
- All computations within the node are based on functionality from Intel MKL
- Computation of leafs & updates of a block are independent on each process
- Data is distributed between processes uniformly





# Implementation of LU decomposition within a "node"





Selecting one thread per process allows us to "mask" data transfers behind computations

### **Current status/interface**

Supports various MPI implementations via BLACs in MKL

### C:

```
{
....

PARDISO (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, &idum, &nrhs, iparm, &msglvl, b, x, &error);
...
}

(comm = MPI_Comm_c2f(MPI_COMM_WORLD);
CPARDISO (pt, &maxfct, &mnum, &mtype, &phase, &n, a, ia, ja, &idum, &nrhs, iparm, &msglvl, b, x, comm, &error);
...
}
```

#### Fortran:

```
call PARDISO(pt, maxfct, mnum, mtype, phase, n, a, ia, ja, idum, nrhs, iparm, msglvl, b, x, error)
...

call CPARDISO(pt, maxfct, mnum, mtype, phase, n, a, ia, ja, idum, nrhs, iparm, msglvl, b, x, comm, &error)
...
```



# **Scalability of Intel MKL Parallel Direct Sparse Solver for Clusters**





Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html">http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html</a>
Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: <a href="http://software.intel.com/en-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-tests-te

ru/articles/optimization-notice/
\*Other brands and names are the property of their respective owners.

- Each node contains two Intel® Xeon® E5-2697 v2 processors (24 cores in total), 64GB RAM
- Intel® MKL 11.2 Beta

<sup>\*</sup>Here and further: The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1 - 1:25. http://www.cise.ufl.edu/research/sparse/matrices.





### **Scalability comparison**





Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: http://software.intel.com/en-

ru/articles/optimization-notice/

- Each node contains two Intel® Xeon® E5-2697 v2 processors (24 cores in total), 64GB RAM
- Intel® MKL 11.2 Beta
- MUMPS\* version 4.10.0



# Performance comparison – Intel MKL speedup over MUMPS\*

MUMPS\* vs. Intel MKL Parallel Direct Sparse Solver for Clusters (time)

16 nodes 12 threads per node (192 total cores)



Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html">http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html</a>

Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: <a href="http://software.intel.com/en-nu/articles/optimization-notice/">http://software.intel.com/en-nu/articles/optimization-notice/</a>

- Each node contains two Intel® Xeon® E5-2697 v2 processors (24 cores in total), 64GB RAM
- Intel® MKL 11.2 Beta
- MUMPS\* version 4.10.0





# Performance comparison – Intel MKL speedup over MUMPS\*

MUMPS\* vs. Intel MKL Parallel Direct Sparse Solver for Clusters (time)
32 nodes 24 threads per node (768 total cores)



Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html">http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html</a>

Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: <a href="http://software.intel.com/en-ru/articles/optimization-notice/">http://software.intel.com/en-ru/articles/optimization-notice/</a>

- Each node contains two Intel® Xeon® E5-2697 v2 processors (24 cores in total), 64GB RAM
- Intel® MKL 11.2 Beta
- MUMPS\* version 4.10.0



# Performance comparison – Intel MKL speedup over MUMPS\*

MUMPS\* vs. Intel MKL Parallel Direct Sparse Solver for Clusters(time)
64 nodes 24 threads per node (1536 total cores)



Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html">http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html</a>

Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: <a href="http://software.intel.com/en-nu/articles/optimization-notice/">http://software.intel.com/en-nu/articles/optimization-notice/</a>

- Each node contains two Intel® Xeon® E5-2697 v2 processors (24 cores in total), 64GB RAM
- Intel® MKL 11.2 Beta
- MUMPS\* version 4.10.0





### **Conclusion**

- Intel® Direct Sparse Solver for Clusters included in Intel MKL 11.2 Beta functionality results in
  - Good scaling of computational time
  - Good scaling of memory per node
- On the Roadmap:
  - Implement pure MPI version
  - Parallelize reordering step
  - Implement Intel® Xeon Phi™ version



# Q & A



# **Legal Disclaimer & Optimization Notice**

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Xeon Phi, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804









# **Experiments** (scalability in time)



Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html">http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html</a>

Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: <a href="http://software.intel.com/en-ru/articles/optimization-notice/">http://software.intel.com/en-ru/articles/optimization-notice/</a>

<sup>\*\*</sup>Here and further: The University of Florida Sparse Matrix Collection T. A. Davis and Y. Hu, ACM Transactions on Mathematical Software, Vol 38, Issue 1, 2011, pp 1:1 - 1:25. http://www.cise.ufl.edu/research/sparse/matrices.





<sup>\*</sup>Other brands and names are the property of their respective owners.

# **Experiments** (scalability in time)

#### 3Dspectralwave\*,\*material problem



number of nodes, each node use 16 threads

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html">http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html</a>

Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: <a href="http://software.intel.com/en-ru/articles/optimization-notice/">http://software.intel.com/en-ru/articles/optimization-notice/</a>

\*Other brands and names are the property of their respective owners.



Factorization and solving steps scale well in terms of memory and performance.

Parallelization of reordering step might lead to "worse" reordering affecting overall time... Deeper investigation is needed here.



# **Experiments** (balancing)





#### Long\_coup\_dt6\*\*





In case of non-uniform "tree", there are a few approaches to divide nodes of the tree between computational nodes. But there is no "best" approach, so to achieve good performance we switch between them at reordering step.

# **Experiments** (scalability on memory)





Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html">http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html</a>

Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: <a href="http://software.intel.com/en-ru/articles/optimization-notice/">http://software.intel.com/en-ru/articles/optimization-notice/</a>



# **Experiments (scalability on memory)**

# Additional processes decrease memory size per host!!!



Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, refer to <a href="http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html">http://www.intel.com/content/www/us/en/benchmarks/resources-benchmark-limitations.html</a>

Refer to our Optimization Notice for more information regarding performance and optimization choices in Intel software products at: <a href="http://software.intel.com/en-ru/articles/optimization-notice/">http://software.intel.com/en-ru/articles/optimization-notice/</a>



