Retargeting PLAPACK to Clusters with Hardware Accelerators
Transcripción
Retargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué1 Francisco Igual1 Enrique S. Quintana-Ortí1 Robert van de Geijn2 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain). 2 Department of Computer Sciences. Texas Institute for Computational Engineering and Sciences. The University of Texas at Austin. Retargeting PLAPACK to Clusters of GPUs 1 Francisco Igual Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: Software: Software: CUDA GPUSs ?? OpenCL Star-PU OpenGL, Cg, . . . FLAME SuperMatrix Our approach: CUPLAPACK ATI Stream Future is today. . . China’s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: Software: Software: CUDA GPUSs ?? OpenCL Star-PU OpenGL, Cg, . . . FLAME SuperMatrix Our approach: CUPLAPACK ATI Stream Future is today. . . China’s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual Motivation Motivation Past, present and future of GPU computing Systems with one GPU Multi-GPU systems Clusters of GPUs Software: Software: Software: CUDA GPUSs ?? OpenCL Star-PU OpenGL, Cg, . . . FLAME SuperMatrix Our approach: CUPLAPACK ATI Stream Future is today. . . China’s new Nebulae Supercomputer is No. 2 at Top500 (June 2010) Powered with Tesla C2050 GPUs Retargeting PLAPACK to Clusters of GPUs 2 Francisco Igual Outline Outline 1 Introduction 2 PLAPACK 3 Porting PLAPACK to clusters of GPUs 4 Experimental results 5 Conclusions and future work Retargeting PLAPACK to Clusters of GPUs 3 Francisco Igual Introduction Introduction PLAPACK: Parallel dense linear algebra package for message-passing architectures (i.e., clusters). CUPLAPACK: Retarget of PLAPACK to clusters with graphics processors. Why a PLAPACK-based approach?: Modularity: layered design. Programmability: FLAME methodology. Object-based approach. Goals: 1 2 3 Programmability: Transparent port to clusters of GPUs. Performance: Clusters of GPUs with hundreds of nodes. Scalability: Keep performance with large problems. Retargeting PLAPACK to Clusters of GPUs 4 Francisco Igual PLAPACK PLAPACK overview PLAPACK: Parallel Linear Algebra Package “Library infrastructure for parallel implementation of linear algebra algorithms on distributed memory machines”. Natural transition from algorithms to distributed-memory codes. Focuses on programmability. Layered design. Focuses on extensibility and flexibility. Physically Based Matrix Distribution (PBMD) and templates. Retargeting PLAPACK to Clusters of GPUs 5 Francisco Igual PLAPACK Cholesky factorization. Right-looking variant (I) Cholesky factorization of a positive definite matrix A: A = LLT Partition: A= A11 A21 ? A22 , L= L11 L21 0 L22 , where A11 and L11 are b × b matrices. From A = LLT , we have that: T T A11 ? L11 0 L11 L21 = T A21 A22 L21 L22 0 L22 T L11 L11 ? = . T T + L LT L21 L11 L21 L21 22 22 Retargeting PLAPACK to Clusters of GPUs 6 Francisco Igual PLAPACK Cholesky factorization. Right-looking variant (I) Cholesky factorization of a positive definite matrix A: A = LLT Partition: A= A11 A21 ? A22 , L= L11 L21 0 L22 , where A11 and L11 are b × b matrices. From A = LLT , we have that: T T A11 ? L11 0 L11 L21 = T A21 A22 L21 L22 0 L22 T L11 L11 ? = . T T + L LT L21 L11 L21 L21 22 22 Retargeting PLAPACK to Clusters of GPUs 6 Francisco Igual PLAPACK Cholesky factorization. Right-looking variant (II) This yields to equations: T L11 L11 = A11 , −T = A21 L11 , L21 T A22 − L21 L21 T = L22 L22 . Concluding with the algorithm: A11 ? 1 Partition A = , being A11 a b × b submatrix. A21 A22 2 T (Cholesky factor of A ). A11 := L11 such that A11 = L11 L11 11 3 −T A21 := L21 = A21 L11 . 4 −T A22 := A22 − L21 L21 . 5 Continue recursively with A = A22 with step 1. Retargeting PLAPACK to Clusters of GPUs 7 Francisco Igual PLAPACK Cholesky factorization. Right-looking variant (II) This yields to equations: T L11 L11 = A11 , −T = A21 L11 , L21 T A22 − L21 L21 T = L22 L22 . Concluding with the algorithm: A11 ? 1 Partition A = , being A11 a b × b submatrix. A21 A22 2 T (Cholesky factor of A ). A11 := L11 such that A11 = L11 L11 11 3 −T A21 := L21 = A21 L11 . 4 −T A22 := A22 − L21 L21 . 5 Continue recursively with A = A22 with step 1. Retargeting PLAPACK to Clusters of GPUs 7 Francisco Igual PLAPACK Using PLAPACK as a library From algorithms to code i n t PLA_Chol ( i n t b , PLA_Obj A ) { PLA_Obj ABR = NULL , A11 = NULL , A21 = NULL ; /∗ . . . ∗/ /∗ View ABR = A ∗/ PLA_Obj_view_all ( A , &ABR ) ; while ( TRUE ) { /∗ P a r t i t i o n ABR = / ∗ | ∗ \ ∗ where A11 i s b x b PLA_Obj_split_4 ( ABR, A11 | | ∗ \ ===== ===== | A21 | | ABR / ∗/ b, b, &A11 , PLA_DUMMY, &A21 , &ABR ) ; /∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/ PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; /∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/ PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG , one , A11 , A21 ) ; /∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/ PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one , A21 , one , ABR ) ; } /∗ . . . ∗/ } Retargeting PLAPACK to Clusters of GPUs 8 Francisco Igual PLAPACK Using PLAPACK as a library From algorithms to code i n t PLA_Chol ( i n t b , PLA_Obj A ) { PLA_Obj ABR = NULL , A11 = NULL , A21 = NULL ; /∗ . . . ∗/ /∗ View ABR = A ∗/ PLA_Obj_view_all ( A , &ABR ) ; while ( TRUE ) { /∗ P a r t i t i o n ABR = / ∗ | ∗ \ ∗ where A11 i s b x b PLA_Obj_split_4 ( ABR, A11 | | ∗ \ ===== ===== | A21 | | ABR / ∗/ b, b, &A11 , PLA_DUMMY, &A21 , &ABR ) ; /∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/ PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; /∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/ PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG , one , A11 , A21 ) ; /∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/ PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one , A21 , one , ABR ) ; } /∗ . . . ∗/ } Retargeting PLAPACK to Clusters of GPUs 8 Francisco Igual PLAPACK Using PLAPACK as a library. Cholesky code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 i n t PLA_Chol ( i n t b , PLA_Obj A ) { PLA_Obj ABR = NULL , A11 = NULL , A21 = NULL ; /∗ . . . ∗/ i n t CUPLA_Chol ( i n t b , PLA_Obj A ) { PLA_Obj ABR = NULL , A11 = NULL , A21 = NULL ; /∗ . . . ∗/ /∗ View ABR = A ∗/ PLA_Obj_view_all ( A , &ABR ) ; /∗ View ABR = A ∗/ PLA_Obj_view_all ( A , &ABR ) ; while ( TRUE ) { /∗ P a r t i t i o n ABR = / ∗ | ∗ \ ∗ where A11 i s b x b PLA_Obj_split_4 ( ABR, while ( TRUE ) { /∗ P a r t i t i o n ABR = / ∗ | ∗ \ ∗ where A11 i s b x b PLA_Obj_split_4 ( ABR, A11 | | ∗ \ ===== ===== | A21 | | ABR / ∗/ b, b, &A11 , PLA_DUMMY, &A21 , &ABR ) ; A11 | | ∗ \ ===== ===== | A21 | | ABR / ∗/ b, b, &A11 , PLA_DUMMY, &A21 , &ABR ) ; /∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/ PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; /∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/ CUPLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; /∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/ PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG , one , A11 , A21 ) ; /∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/ CUPLA_Trsm( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG , one , A11 , A21 ) ; /∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/ PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one , A21 , one , ABR ) ; /∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/ CUPLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one , A21 , one , ABR ) ; } /∗ . . . ∗/ } /∗ . . . ∗/ } } Retargeting PLAPACK to Clusters of GPUs 9 Francisco Igual PLAPACK Using PLAPACK as a library. Cholesky code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 i n t PLA_Chol ( i n t b , PLA_Obj A ) { PLA_Obj ABR = NULL , A11 = NULL , A21 = NULL ; /∗ . . . ∗/ i n t CUPLA_Chol ( i n t b , PLA_Obj A ) { PLA_Obj ABR = NULL , A11 = NULL , A21 = NULL ; /∗ . . . ∗/ /∗ View ABR = A ∗/ PLA_Obj_view_all ( A , &ABR ) ; /∗ View ABR = A ∗/ PLA_Obj_view_all ( A , &ABR ) ; while ( TRUE ) { /∗ P a r t i t i o n ABR = / ∗ | ∗ \ ∗ where A11 i s b x b PLA_Obj_split_4 ( ABR, while ( TRUE ) { /∗ P a r t i t i o n ABR = / ∗ | ∗ \ ∗ where A11 i s b x b PLA_Obj_split_4 ( ABR, A11 | | ∗ \ ===== ===== | A21 | | ABR / ∗/ b, b, &A11 , PLA_DUMMY, &A21 , &ABR ) ; A11 | | ∗ \ ===== ===== | A21 | | ABR / ∗/ b, b, &A11 , PLA_DUMMY, &A21 , &ABR ) ; /∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/ PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; /∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/ CUPLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ; /∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/ PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG , one , A11 , A21 ) ; /∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/ CUPLA_Trsm( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR, PLA_TRANSPOSE, PLA_NONUNIT_DIAG , one , A11 , A21 ) ; /∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/ PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one , A21 , one , ABR ) ; /∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/ CUPLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS, minus_one , A21 , one , ABR ) ; } /∗ . . . ∗/ } /∗ . . . ∗/ } } Retargeting PLAPACK to Clusters of GPUs 9 Francisco Igual Porting PLAPACK to clusters of GPUs GPU acceleration of PLAPACK How to accelerate PLAPACK? Naive implementation: Objects created in RAM. GPU CPU transfers bound to calculations. Intercept calls to local BLAS and transfer data to/from GPU. Straightforward implementation. Tuned implementation: Objects created in GPU. GPU CPU transfers bound to communications. Only transfer to RAM when necessary (communication). Creation of objects involves allocation on GPUs. Modification of communication layer in PLAPACK (PLA_Copy, PLA_Reduce). Retargeting PLAPACK to Clusters of GPUs 10 Francisco Igual Porting PLAPACK to clusters of GPUs GPU acceleration of PLAPACK How to accelerate PLAPACK? Naive implementation: Objects created in RAM. GPU CPU transfers bound to calculations. Intercept calls to local BLAS and transfer data to/from GPU. Straightforward implementation. Tuned implementation: Objects created in GPU. GPU CPU transfers bound to communications. Only transfer to RAM when necessary (communication). Creation of objects involves allocation on GPUs. Modification of communication layer in PLAPACK (PLA_Copy, PLA_Reduce). Retargeting PLAPACK to Clusters of GPUs 10 Francisco Igual Porting PLAPACK to clusters of GPUs PLAPACK infrastructure PLAPACK layered design Necessary changes to use GPU 1 PLA_Local_BLAS: N VIDIA CUBLAS. 2 LA object manipulation: management of objects in GPU memory. 3 Communication layer: transfers to RAM prior to communications. Retargeting PLAPACK to Clusters of GPUs 11 Francisco Igual Porting PLAPACK to clusters of GPUs PLAPACK infrastructure PLAPACK layered design Necessary changes to use GPU 1 PLA_Local_BLAS: N VIDIA CUBLAS. 2 LA object manipulation: management of objects in GPU memory. 3 Communication layer: transfers to RAM prior to communications. Retargeting PLAPACK to Clusters of GPUs 11 Francisco Igual Porting PLAPACK to clusters of GPUs LA object manipulation. Cholesky driver using GPU 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 i n t main ( void ) { /∗ . . . ∗/ / / O b j e c t c r e a t i o n on GPU CUPLA_Matrix_create ( MPI_FLOAT , s i z e , s i z e , templ , PLA_ALIGN_FIRST , PLA_ALIGN_FIRST , &A_GPU ) ; / / O b j e c t i n i t i a l i z a t i o n on GPU CUPLA_Obj_set_to_SPD_random ( A_GPU ) ; /∗ . . . ∗/ / / Cholesky f a c t o r i z a t i o n on GPU CUPLA_Chol ( nb_alg , A_GPU ) ; /∗ . . . ∗/ / / O b j e c t d e s t r u c t i o n on GPU CUPLA_Obj_free ( &A_GPU ) ; } CUPLA_Matrix_create creates a L. A. object on GPU. CUPLA_Obj_free destroys an object on GPU. CUPLA_Chol operates with objects created on GPU. Retargeting PLAPACK to Clusters of GPUs 12 Francisco Igual Porting PLAPACK to clusters of GPUs Communication layer Communication is restricted to routines PLA_Copy and PLA_Reduce: int PLA_Copy ( PLA_Obj obj_from, PLA_Obj obj_to ) Purpose: Copy contents between linear algebra objects. IN IN/OUT obj_from obj_to Object to be copied Object into which to copy int PLA_Reduce ( PLA_Obj obj_from, MPI_OP op, PLA_Obj obj_to ) Purpose: Reduce using op the contents in the duplicated object given by obj_from and overwrite obj_to with the result. IN IN IN/OUT Retargeting PLAPACK to Clusters of GPUs obj_from op obj_to Object to be reduced Reduce operator to be used Object into which to reduce result 13 Francisco Igual Porting PLAPACK to clusters of GPUs Communication layer Communication is restricted to routines PLA_Copy and PLA_Reduce: int PLA_Copy ( PLA_Obj obj_from, PLA_Obj obj_to ) Purpose: Copy contents between linear algebra objects. IN IN/OUT obj_from obj_to Object to be copied Object into which to copy int PLA_Reduce ( PLA_Obj obj_from, MPI_OP op, PLA_Obj obj_to ) Purpose: Reduce using op the contents in the duplicated object given by obj_from and overwrite obj_to with the result. IN IN IN/OUT Retargeting PLAPACK to Clusters of GPUs obj_from op obj_to Object to be reduced Reduce operator to be used Object into which to reduce result 13 Francisco Igual Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (I) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 /∗ PLAPACK PLA_Copy between column a l i g n e d m a t r i c e s ∗/ while ( TRUE ) { P L A _ O b j _ s p l i t _ s i z e ( from_cur , PLA_SIDE_TOP , &size_from , &owner_from ) ; PLA_Obj_split_size ( to_cur , PLA_SIDE_TOP , &s i z e _ t o , &owner_to ) ; i f ( 0 == ( s i z e = min ( size_from , s i z e _ t o ) ) ) break ; PLA_Obj_horz_split_2 ( from_cur , s i z e , &from_1 , &from_cur ) ; PLA_Obj_horz_split_2 ( t o _ c u r , s i z e , &to_1 , &t o _ c u r ) ; i f ( myrow == owner_from && owner_from == owner_to ) { PLA_Local_copy ( from_1 , to_1 ) ; } else { i f ( myrow == owner_from ) { PLA_Obj_get_local_contents ( from_1 , PLA_NO_TRANS, &dummy, &dummy, buffer_temp , s i z e , 1 ) ; MPI_Send ( BF ( b u f f e r _ t e m p ) , s i z e ∗ l o c a l _ w i d t h , datatype , owner_to , 0 , comm_col ) ; } i f ( myrow == owner_to ) { MPI_Recv ( BF ( b u f f e r _ t e m p ) , s i z e ∗ l o c a l _ w i d t h , datatype , owner_from , MPI_ANY_TAG, comm_col , & s t a t u s ) ; PLA_Obj_set_local_contents ( PLA_NO_TRANS, s i z e , l o c a l _ w i d t h , buffer_temp , s i z e , 1 , to_1 ) ; } } } PLA_free ( b u f f e r _ t e m p ) ; Retargeting PLAPACK to Clusters of GPUs 14 Francisco Igual Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (II) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 /∗ PLAPACK P L A _ O b j _ g e t _ l o c a l _ c o n t e n t s ∗/ /∗ CUPLAPACK P L A _ O b j _ g e t _ l o c a l _ c o n t e n t s ∗/ i n t PLA_Obj_get_local_contents PLA_Obj obj , int trans , int ∗rows_in_buf , int ∗c o l s _ i n _ b u f , void ∗buf , int ldim_buf , int stride_buf ) i n t CUPLA_Obj_get_local_contents PLA_Obj obj , int trans , int ∗rows_in_buf , int ∗c o l s _ i n _ b u f , void ∗buf , int ldim_buf , int stride_buf ) ( ( /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ f o r ( j =0; j <n ; j ++ ) { t m p _ l o c a l = b u f _ l o c a l + j ∗l d i m _ b u f∗t y p e s i z e ; tmp_obj = buf_obj + j ∗l d i m∗t y p e s i z e ; memcpy( t m p _ l o c a l , tmp_obj , m∗t y p e s i z e ) ; } cublasGetMatrix ( m, n , t y p e s i z e , buf_obj , ldim , buf_local , ldim_buf ) ; /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ Necessary GPU CPU transfers Data is transferred to RAM prior to a communication. Data is transferred to GPU after each communication. Retargeting PLAPACK to Clusters of GPUs 15 Francisco Igual Porting PLAPACK to clusters of GPUs Modifications to PLA_Copy (II) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 /∗ PLAPACK P L A _ O b j _ g e t _ l o c a l _ c o n t e n t s ∗/ /∗ CUPLAPACK P L A _ O b j _ g e t _ l o c a l _ c o n t e n t s ∗/ i n t PLA_Obj_get_local_contents PLA_Obj obj , int trans , int ∗rows_in_buf , int ∗c o l s _ i n _ b u f , void ∗buf , int ldim_buf , int stride_buf ) i n t CUPLA_Obj_get_local_contents PLA_Obj obj , int trans , int ∗rows_in_buf , int ∗c o l s _ i n _ b u f , void ∗buf , int ldim_buf , int stride_buf ) ( ( /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ f o r ( j =0; j <n ; j ++ ) { t m p _ l o c a l = b u f _ l o c a l + j ∗l d i m _ b u f∗t y p e s i z e ; tmp_obj = buf_obj + j ∗l d i m∗t y p e s i z e ; memcpy( t m p _ l o c a l , tmp_obj , m∗t y p e s i z e ) ; } cublasGetMatrix ( m, n , t y p e s i z e , buf_obj , ldim , buf_local , ldim_buf ) ; /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ /∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/ Necessary GPU CPU transfers Data is transferred to RAM prior to a communication. Data is transferred to GPU after each communication. Retargeting PLAPACK to Clusters of GPUs 15 Francisco Igual Experimental results The L ONGHORN visualization cluster Per Node Number of cores CPU Available memory Per System (256 nodes) 8 2,048 Intel Xeon Nehalem @ 2.53 GHz 48 Gbytes 13.5 TBytes Interconnection network QDR Infiniband Graphics system GPU Interconnection bus Available video memory 128 NVIDIA Quadro Plex S4s 2 x NVIDIA Quadro FX5800 512 x NVIDIA Quadro FX5800 PCIExpress 2.0 (8x) 8 Gbytes DDR3 2 TBytes DDR3 Peak performance (SP) Peak performance (DP) 161.6 GFLOPS 80.8 GFLOPS 41.40 TFLOPS 20.70 TFLOPS Table: Detailed features of the L ONGHORN cluster. PLAPACK: version R32. BLAS: N VIDIA CUBLAS 2.2, M KL 10.1. MPI: M VAPICH 1.4. Retargeting PLAPACK to Clusters of GPUs 16 Francisco Igual Experimental results GEMM: performance CUPLAPACK. Matrix-matrix multiplication on LONGHORN 9000 32 Quadro FX5800 16 Quadro FX5800 8 Quadro FX5800 4 Quadro FX5800 2 Quadro FX5800 1 Quadro FX5800 8000 7000 GFLOPS 6000 5000 4000 3000 2000 1000 0 0 10000 20000 30000 40000 50000 60000 Matrix size (m=n=k) Figure: Performance of CUPLAPACK for the matrix-matrix multiplication on L ONGHORN. Retargeting PLAPACK to Clusters of GPUs 17 Francisco Igual Experimental results Cholesky factorization: performance CUPLAPACK. Cholesky factorization on LONGHORN 5000 32 Quadro FX5800 16 Quadro FX5800 8 Quadro FX5800 4 Quadro FX5800 2 Quadro FX5800 1 Quadro FX5800 4000 GFLOPS 3000 2000 1000 0 0 20000 40000 60000 Matrix size 80000 100000 Figure: Performance of CUPLAPACK for the Cholesky factorization on L ONGHORN. Retargeting PLAPACK to Clusters of GPUs 18 Francisco Igual Experimental results GEMM: scalability CUPLAPACK. Matrix-matrix multiplication on LONGHORN 32 Quadro FX5800 16 Quadro FX5800 8 Quadro FX5800 4 Quadro FX5800 2 Quadro FX5800 1 Quadro FX5800 30 25 Speedup 20 15 10 5 0 0 10000 20000 30000 40000 Matrix size (m=n=k) 50000 60000 Figure: Scalability of the PLAPACK-based codes for the matrix-matrix multiplication on L ONGHORN. The reference is the performance of C UBLAS S GEMM on one GPU. Retargeting PLAPACK to Clusters of GPUs 19 Francisco Igual Experimental results GEMM: PLAPACK vs. CUPLAPACK CUPLAPACK. Matrix-matrix multiplication on LONGHORN (16 nodes) 7000 CUPLAPACK - 32 Quadro FX5800 CUPLAPACK - 16 Quadro FX5800 PLAPACK - 128 Intel Xeon Nehalem Cores 6000 GFLOPS 5000 4000 3000 2000 1000 0 0 5000 10000 15000 20000 25000 Matrix size (m=n=k) 30000 35000 40000 Figure: GEMM on 16 nodes of L ONGHORN. PLAPACK vs CUPLAPACK. Retargeting PLAPACK to Clusters of GPUs 20 Francisco Igual Experimental results Cholesky factorization: PLAPACK vs. CUPLAPACK CUPLAPACK. Cholesky factorization on LONGHORN (16 nodes) 3500 CUPLAPACK - 32 Quadro FX5800 CUPLAPACK - 16 Quadro FX5800 PLAPACK - 128 Intel Xeon Nehalem Cores 3000 GFLOPS 2500 2000 1500 1000 500 0 0 10000 20000 30000 Matrix size 40000 50000 60000 Figure: Cholesky factorization on 16 nodes of L ONGHORN. PLAPACK vs CUPLAPACK. Retargeting PLAPACK to Clusters of GPUs 21 Francisco Igual Experimental results Multiple GPUs per node vs. One GPU per node CUPLAPACK. Cholesky factorization on LONGHORN 5000 32 Quadro FX5800 on 32 nodes 32 Quadro FX5800 on 16 nodes 4000 GFLOPS 3000 2000 1000 0 0 20000 40000 60000 Matrix size 80000 100000 Figure: Cholesky factorization on 32 GPUs of L ONGHORN, using one or two GPUs per node. Retargeting PLAPACK to Clusters of GPUs 22 Francisco Igual Conclusions and future work Conclusions and future work Conclusions Retarget of PLAPACK to hybrid architectures. Transparent for programmers and library developers. Benefits from the layered design of PLAPACK. Remarkable performance and scalability on large GPU clusters. Future work Port the Elemental framework to GPUs. Out-of-core + GPU support to deal with even larger problems. Retargeting PLAPACK to Clusters of GPUs 23 Francisco Igual Conclusions and future work Acknowledgements Thank you. . . Questions? We thank the Texas Advanced Computing Center for granting access to the platform where the experiments were conducteda . a http://www.tacc.utexas.edu/resources/visualization/#longhorn Retargeting PLAPACK to Clusters of GPUs 24 Francisco Igual