Retargeting PLAPACK to Clusters with Hardware Accelerators

Transcripción

Retargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters
with Hardware Accelerators
Manuel Fogué1 Francisco Igual1
Enrique S. Quintana-Ortí1 Robert van de Geijn2
1
Departamento de Ingeniería y Ciencia de los Computadores.
University Jaume I.
Castellón (Spain).
2
Department of Computer Sciences.
Texas Institute for Computational Engineering and Sciences.
The University of Texas at Austin.
Retargeting PLAPACK to Clusters of GPUs
1
Francisco Igual
Motivation
Motivation
Past, present and future of GPU computing
Systems with one GPU
Multi-GPU systems
Clusters of GPUs
Software:
Software:
Software:
CUDA
GPUSs
??
OpenCL
Star-PU
OpenGL, Cg, . . .
FLAME SuperMatrix
Our approach:
CUPLAPACK
ATI Stream
Future is today. . .
China’s new Nebulae Supercomputer is No. 2 at Top500 (June 2010)
Powered with Tesla C2050 GPUs
Retargeting PLAPACK to Clusters of GPUs
2
Francisco Igual
Motivation
Motivation
Past, present and future of GPU computing
Systems with one GPU
Multi-GPU systems
Clusters of GPUs
Software:
Software:
Software:
CUDA
GPUSs
??
OpenCL
Star-PU
OpenGL, Cg, . . .
FLAME SuperMatrix
Our approach:
CUPLAPACK
ATI Stream
Future is today. . .
China’s new Nebulae Supercomputer is No. 2 at Top500 (June 2010)
Powered with Tesla C2050 GPUs
Retargeting PLAPACK to Clusters of GPUs
2
Francisco Igual
Motivation
Motivation
Past, present and future of GPU computing
Systems with one GPU
Multi-GPU systems
Clusters of GPUs
Software:
Software:
Software:
CUDA
GPUSs
??
OpenCL
Star-PU
OpenGL, Cg, . . .
FLAME SuperMatrix
Our approach:
CUPLAPACK
ATI Stream
Future is today. . .
China’s new Nebulae Supercomputer is No. 2 at Top500 (June 2010)
Powered with Tesla C2050 GPUs
Retargeting PLAPACK to Clusters of GPUs
2
Francisco Igual
Outline
Outline
1
Introduction
2
PLAPACK
3
Porting PLAPACK to clusters of GPUs
4
Experimental results
5
Conclusions and future work
Retargeting PLAPACK to Clusters of GPUs
3
Francisco Igual
Introduction
Introduction
PLAPACK:
Parallel dense linear algebra package for message-passing
architectures (i.e., clusters).
CUPLAPACK:
Retarget of PLAPACK to clusters with graphics processors.
Why a PLAPACK-based approach?:
Modularity: layered design.
Programmability: FLAME methodology. Object-based approach.
Goals:
1
2
3
Programmability: Transparent port to clusters of GPUs.
Performance: Clusters of GPUs with hundreds of nodes.
Scalability: Keep performance with large problems.
Retargeting PLAPACK to Clusters of GPUs
4
Francisco Igual
PLAPACK
PLAPACK overview
PLAPACK: Parallel Linear Algebra Package
“Library infrastructure for parallel implementation of linear
algebra algorithms on distributed memory machines”.
Natural transition from algorithms to distributed-memory codes.
Focuses on programmability.
Layered design. Focuses on extensibility and flexibility.
Physically Based Matrix Distribution (PBMD) and templates.
Retargeting PLAPACK to Clusters of GPUs
5
Francisco Igual
PLAPACK
Cholesky factorization. Right-looking variant (I)
Cholesky factorization of a positive definite matrix A:
A = LLT
Partition:
A=
A11
A21
?
A22
, L=
L11
L21
0
L22
,
where A11 and L11 are b × b matrices.
From A = LLT , we have that:
T
T A11
?
L11
0
L11 L21
=
T
A21 A22
L21 L22
0
L22
T
L11 L11
?
=
.
T
T + L LT
L21 L11
L21 L21
22 22
Retargeting PLAPACK to Clusters of GPUs
6
Francisco Igual
PLAPACK
Cholesky factorization. Right-looking variant (I)
Cholesky factorization of a positive definite matrix A:
A = LLT
Partition:
A=
A11
A21
?
A22
, L=
L11
L21
0
L22
,
where A11 and L11 are b × b matrices.
From A = LLT , we have that:
T
T A11
?
L11
0
L11 L21
=
T
A21 A22
L21 L22
0
L22
T
L11 L11
?
=
.
T
T + L LT
L21 L11
L21 L21
22 22
Retargeting PLAPACK to Clusters of GPUs
6
Francisco Igual
PLAPACK
Cholesky factorization. Right-looking variant (II)
This yields to equations:
T
L11 L11
= A11 ,
−T
= A21 L11
,
L21
T
A22 − L21 L21
T
= L22 L22
.
Concluding with the algorithm:
A11
?
1
Partition A =
, being A11 a b × b submatrix.
A21 A22
2
T (Cholesky factor of A ).
A11 := L11 such that A11 = L11 L11
11
3
−T
A21 := L21 = A21 L11
.
4
−T
A22 := A22 − L21 L21
.
5
Continue recursively with A = A22 with step 1.
Retargeting PLAPACK to Clusters of GPUs
7
Francisco Igual
PLAPACK
Cholesky factorization. Right-looking variant (II)
This yields to equations:
T
L11 L11
= A11 ,
−T
= A21 L11
,
L21
T
A22 − L21 L21
T
= L22 L22
.
Concluding with the algorithm:
A11
?
1
Partition A =
, being A11 a b × b submatrix.
A21 A22
2
T (Cholesky factor of A ).
A11 := L11 such that A11 = L11 L11
11
3
−T
A21 := L21 = A21 L11
.
4
−T
A22 := A22 − L21 L21
.
5
Continue recursively with A = A22 with step 1.
Retargeting PLAPACK to Clusters of GPUs
7
Francisco Igual
PLAPACK
Using PLAPACK as a library
From algorithms to code
i n t PLA_Chol ( i n t b , PLA_Obj A )
{
PLA_Obj ABR = NULL ,
A11 = NULL ,
A21 = NULL ;
/∗ . . . ∗/
/∗ View ABR = A ∗/
PLA_Obj_view_all ( A , &ABR ) ;
while ( TRUE ) {
/∗ P a r t i t i o n ABR = /
∗
|
∗
\
∗ where A11 i s b x b
PLA_Obj_split_4 ( ABR,
A11 | |
∗ \
===== ===== |
A21 | | ABR /
∗/
b, b,
&A11 , PLA_DUMMY,
&A21 , &ABR ) ;
/∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/
PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ;
/∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/
PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR,
PLA_TRANSPOSE, PLA_NONUNIT_DIAG ,
one , A11 , A21 ) ;
/∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/
PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS,
minus_one , A21 ,
one ,
ABR ) ;
}
/∗ . . . ∗/
}
Retargeting PLAPACK to Clusters of GPUs
8
Francisco Igual
PLAPACK
Using PLAPACK as a library
From algorithms to code
i n t PLA_Chol ( i n t b , PLA_Obj A )
{
PLA_Obj ABR = NULL ,
A11 = NULL ,
A21 = NULL ;
/∗ . . . ∗/
/∗ View ABR = A ∗/
PLA_Obj_view_all ( A , &ABR ) ;
while ( TRUE ) {
/∗ P a r t i t i o n ABR = /
∗
|
∗
\
∗ where A11 i s b x b
PLA_Obj_split_4 ( ABR,
A11 | |
∗ \
===== ===== |
A21 | | ABR /
∗/
b, b,
&A11 , PLA_DUMMY,
&A21 , &ABR ) ;
/∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/
PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ;
/∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/
PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR,
PLA_TRANSPOSE, PLA_NONUNIT_DIAG ,
one , A11 , A21 ) ;
/∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/
PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS,
minus_one , A21 ,
one ,
ABR ) ;
}
/∗ . . . ∗/
}
Retargeting PLAPACK to Clusters of GPUs
8
Francisco Igual
PLAPACK
Using PLAPACK as a library. Cholesky code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
i n t PLA_Chol ( i n t b , PLA_Obj A )
{
PLA_Obj ABR = NULL ,
A11 = NULL ,
A21 = NULL ;
/∗ . . . ∗/
i n t CUPLA_Chol ( i n t b , PLA_Obj A )
{
PLA_Obj ABR = NULL ,
A11 = NULL ,
A21 = NULL ;
/∗ . . . ∗/
/∗ View ABR = A ∗/
PLA_Obj_view_all ( A , &ABR ) ;
/∗ View ABR = A ∗/
PLA_Obj_view_all ( A , &ABR ) ;
while ( TRUE ) {
/∗ P a r t i t i o n ABR = /
∗
|
∗
\
∗ where A11 i s b x b
PLA_Obj_split_4 ( ABR,
while ( TRUE ) {
/∗ P a r t i t i o n ABR = /
∗
|
∗
\
∗ where A11 i s b x b
PLA_Obj_split_4 ( ABR,
A11 | |
∗ \
===== ===== |
A21 | | ABR /
∗/
b, b,
&A11 , PLA_DUMMY,
&A21 , &ABR ) ;
A11 | |
∗ \
===== ===== |
A21 | | ABR /
∗/
b, b,
&A11 , PLA_DUMMY,
&A21 , &ABR ) ;
/∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/
PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ;
/∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/
CUPLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ;
/∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/
PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR,
PLA_TRANSPOSE, PLA_NONUNIT_DIAG ,
one , A11 ,
A21 ) ;
/∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/
CUPLA_Trsm( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR,
PLA_TRANSPOSE, PLA_NONUNIT_DIAG ,
one , A11 ,
A21 ) ;
/∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/
PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS,
minus_one , A21 ,
one ,
ABR ) ;
/∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/
CUPLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS,
minus_one , A21 ,
one ,
ABR ) ;
}
/∗ . . . ∗/
}
/∗ . . . ∗/
}
}
Retargeting PLAPACK to Clusters of GPUs
9
Francisco Igual
PLAPACK
Using PLAPACK as a library. Cholesky code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
i n t PLA_Chol ( i n t b , PLA_Obj A )
{
PLA_Obj ABR = NULL ,
A11 = NULL ,
A21 = NULL ;
/∗ . . . ∗/
i n t CUPLA_Chol ( i n t b , PLA_Obj A )
{
PLA_Obj ABR = NULL ,
A11 = NULL ,
A21 = NULL ;
/∗ . . . ∗/
/∗ View ABR = A ∗/
PLA_Obj_view_all ( A , &ABR ) ;
/∗ View ABR = A ∗/
PLA_Obj_view_all ( A , &ABR ) ;
while ( TRUE ) {
/∗ P a r t i t i o n ABR = /
∗
|
∗
\
∗ where A11 i s b x b
PLA_Obj_split_4 ( ABR,
while ( TRUE ) {
/∗ P a r t i t i o n ABR = /
∗
|
∗
\
∗ where A11 i s b x b
PLA_Obj_split_4 ( ABR,
A11 | |
∗ \
===== ===== |
A21 | | ABR /
∗/
b, b,
&A11 , PLA_DUMMY,
&A21 , &ABR ) ;
A11 | |
∗ \
===== ===== |
A21 | | ABR /
∗/
b, b,
&A11 , PLA_DUMMY,
&A21 , &ABR ) ;
/∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/
PLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ;
/∗ A11 : = L11 = Cholesky F a c t o r ( A11 ) ∗/
CUPLA_Local_chol ( PLA_LOWER_TRIANGULAR, A11 ) ;
/∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/
PLA_Trsm ( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR,
PLA_TRANSPOSE, PLA_NONUNIT_DIAG ,
one , A11 ,
A21 ) ;
/∗ Update A21 : = L21 = A21 ∗ i n v ( L11 ’ ) ∗/
CUPLA_Trsm( PLA_SIDE_RIGHT , PLA_LOWER_TRIANGULAR,
PLA_TRANSPOSE, PLA_NONUNIT_DIAG ,
one , A11 ,
A21 ) ;
/∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/
PLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS,
minus_one , A21 ,
one ,
ABR ) ;
/∗ Update A22 : = A22 − L21 ∗ L21 ’ ∗/
CUPLA_Syrk ( PLA_LOWER_TRIANGULAR, PLA_NO_TRANS,
minus_one , A21 ,
one ,
ABR ) ;
}
/∗ . . . ∗/
}
/∗ . . . ∗/
}
}
Retargeting PLAPACK to Clusters of GPUs
9
Francisco Igual
Porting PLAPACK to clusters of GPUs
GPU acceleration of PLAPACK
How to accelerate PLAPACK?
Naive implementation:
Objects created in RAM.
GPU CPU transfers bound to calculations.
Intercept calls to local BLAS and transfer data to/from GPU.
Straightforward implementation.
Tuned implementation:
Objects created in GPU.
GPU CPU transfers bound to communications.
Only transfer to RAM when necessary (communication).
Creation of objects involves allocation on GPUs.
Modification of communication layer in PLAPACK (PLA_Copy,
PLA_Reduce).
Retargeting PLAPACK to Clusters of GPUs
10
Francisco Igual
Porting PLAPACK to clusters of GPUs
GPU acceleration of PLAPACK
How to accelerate PLAPACK?
Naive implementation:
Objects created in RAM.
GPU CPU transfers bound to calculations.
Intercept calls to local BLAS and transfer data to/from GPU.
Straightforward implementation.
Tuned implementation:
Objects created in GPU.
GPU CPU transfers bound to communications.
Only transfer to RAM when necessary (communication).
Creation of objects involves allocation on GPUs.
Modification of communication layer in PLAPACK (PLA_Copy,
PLA_Reduce).
Retargeting PLAPACK to Clusters of GPUs
10
Francisco Igual
Porting PLAPACK to clusters of GPUs
PLAPACK infrastructure
PLAPACK layered design
Necessary changes to use GPU
1
PLA_Local_BLAS: N VIDIA CUBLAS.
2
LA object manipulation: management of objects in GPU memory.
3
Communication layer: transfers to RAM prior to communications.
Retargeting PLAPACK to Clusters of GPUs
11
Francisco Igual
Porting PLAPACK to clusters of GPUs
PLAPACK infrastructure
PLAPACK layered design
Necessary changes to use GPU
1
PLA_Local_BLAS: N VIDIA CUBLAS.
2
LA object manipulation: management of objects in GPU memory.
3
Communication layer: transfers to RAM prior to communications.
Retargeting PLAPACK to Clusters of GPUs
11
Francisco Igual
Porting PLAPACK to clusters of GPUs
LA object manipulation. Cholesky driver using GPU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
i n t main ( void ) {
/∗ . . . ∗/
/ / O b j e c t c r e a t i o n on GPU
CUPLA_Matrix_create ( MPI_FLOAT , s i z e , s i z e , templ ,
PLA_ALIGN_FIRST , PLA_ALIGN_FIRST ,
&A_GPU ) ;
/ / O b j e c t i n i t i a l i z a t i o n on GPU
CUPLA_Obj_set_to_SPD_random ( A_GPU ) ;
/∗ . . . ∗/
/ / Cholesky f a c t o r i z a t i o n on GPU
CUPLA_Chol ( nb_alg , A_GPU ) ;
/∗ . . . ∗/
/ / O b j e c t d e s t r u c t i o n on GPU
CUPLA_Obj_free ( &A_GPU ) ;
}
CUPLA_Matrix_create creates a L. A. object on GPU.
CUPLA_Obj_free destroys an object on GPU.
CUPLA_Chol operates with objects created on GPU.
Retargeting PLAPACK to Clusters of GPUs
12
Francisco Igual
Porting PLAPACK to clusters of GPUs
Communication layer
Communication is restricted to routines PLA_Copy and PLA_Reduce:
int PLA_Copy ( PLA_Obj obj_from, PLA_Obj obj_to )
Purpose: Copy contents between linear algebra objects.
IN
IN/OUT
obj_from
obj_to
Object to be copied
Object into which to copy
int PLA_Reduce ( PLA_Obj obj_from, MPI_OP op, PLA_Obj obj_to )
Purpose: Reduce using op the contents in the duplicated object given by
obj_from and overwrite obj_to with the result.
IN
IN
IN/OUT
Retargeting PLAPACK to Clusters of GPUs
obj_from
op
obj_to
Object to be reduced
Reduce operator to be used
Object into which to reduce result
13
Francisco Igual
Porting PLAPACK to clusters of GPUs
Communication layer
Communication is restricted to routines PLA_Copy and PLA_Reduce:
int PLA_Copy ( PLA_Obj obj_from, PLA_Obj obj_to )
Purpose: Copy contents between linear algebra objects.
IN
IN/OUT
obj_from
obj_to
Object to be copied
Object into which to copy
int PLA_Reduce ( PLA_Obj obj_from, MPI_OP op, PLA_Obj obj_to )
Purpose: Reduce using op the contents in the duplicated object given by
obj_from and overwrite obj_to with the result.
IN
IN
IN/OUT
Retargeting PLAPACK to Clusters of GPUs
obj_from
op
obj_to
Object to be reduced
Reduce operator to be used
Object into which to reduce result
13
Francisco Igual
Porting PLAPACK to clusters of GPUs
Modifications to PLA_Copy (I)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/∗ PLAPACK PLA_Copy between column a l i g n e d m a t r i c e s ∗/
while ( TRUE ) {
P L A _ O b j _ s p l i t _ s i z e ( from_cur , PLA_SIDE_TOP , &size_from , &owner_from ) ;
PLA_Obj_split_size ( to_cur ,
PLA_SIDE_TOP , &s i z e _ t o ,
&owner_to ) ;
i f ( 0 == ( s i z e = min ( size_from , s i z e _ t o ) ) ) break ;
PLA_Obj_horz_split_2 ( from_cur , s i z e , &from_1 , &from_cur ) ;
PLA_Obj_horz_split_2 ( t o _ c u r ,
s i z e , &to_1 ,
&t o _ c u r ) ;
i f ( myrow == owner_from && owner_from == owner_to ) {
PLA_Local_copy ( from_1 , to_1 ) ;
}
else {
i f ( myrow == owner_from ) {
PLA_Obj_get_local_contents ( from_1 , PLA_NO_TRANS, &dummy, &dummy,
buffer_temp , s i z e , 1 ) ;
MPI_Send ( BF ( b u f f e r _ t e m p ) , s i z e ∗ l o c a l _ w i d t h , datatype ,
owner_to , 0 , comm_col ) ;
}
i f ( myrow == owner_to ) {
MPI_Recv ( BF ( b u f f e r _ t e m p ) , s i z e ∗ l o c a l _ w i d t h , datatype ,
owner_from , MPI_ANY_TAG, comm_col , & s t a t u s ) ;
PLA_Obj_set_local_contents ( PLA_NO_TRANS, s i z e , l o c a l _ w i d t h ,
buffer_temp , s i z e , 1 , to_1 ) ;
}
}
}
PLA_free ( b u f f e r _ t e m p ) ;
Retargeting PLAPACK to Clusters of GPUs
14
Francisco Igual
Porting PLAPACK to clusters of GPUs
Modifications to PLA_Copy (II)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/∗ PLAPACK P L A _ O b j _ g e t _ l o c a l _ c o n t e n t s ∗/
/∗ CUPLAPACK P L A _ O b j _ g e t _ l o c a l _ c o n t e n t s ∗/
i n t PLA_Obj_get_local_contents
PLA_Obj
obj ,
int
trans ,
int
∗rows_in_buf ,
int
∗c o l s _ i n _ b u f ,
void
∗buf ,
int
ldim_buf ,
int
stride_buf )
i n t CUPLA_Obj_get_local_contents
PLA_Obj
obj ,
int
trans ,
int
∗rows_in_buf ,
int
∗c o l s _ i n _ b u f ,
void
∗buf ,
int
ldim_buf ,
int
stride_buf )
(
(
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
f o r ( j =0; j <n ; j ++ ) {
t m p _ l o c a l = b u f _ l o c a l + j ∗l d i m _ b u f∗t y p e s i z e ;
tmp_obj
= buf_obj
+ j ∗l d i m∗t y p e s i z e ;
memcpy( t m p _ l o c a l , tmp_obj , m∗t y p e s i z e ) ;
}
cublasGetMatrix ( m, n , t y p e s i z e ,
buf_obj ,
ldim ,
buf_local , ldim_buf ) ;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
Necessary GPU CPU transfers
Data is transferred to RAM prior to a communication.
Data is transferred to GPU after each communication.
Retargeting PLAPACK to Clusters of GPUs
15
Francisco Igual
Porting PLAPACK to clusters of GPUs
Modifications to PLA_Copy (II)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/∗ PLAPACK P L A _ O b j _ g e t _ l o c a l _ c o n t e n t s ∗/
/∗ CUPLAPACK P L A _ O b j _ g e t _ l o c a l _ c o n t e n t s ∗/
i n t PLA_Obj_get_local_contents
PLA_Obj
obj ,
int
trans ,
int
∗rows_in_buf ,
int
∗c o l s _ i n _ b u f ,
void
∗buf ,
int
ldim_buf ,
int
stride_buf )
i n t CUPLA_Obj_get_local_contents
PLA_Obj
obj ,
int
trans ,
int
∗rows_in_buf ,
int
∗c o l s _ i n _ b u f ,
void
∗buf ,
int
ldim_buf ,
int
stride_buf )
(
(
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
f o r ( j =0; j <n ; j ++ ) {
t m p _ l o c a l = b u f _ l o c a l + j ∗l d i m _ b u f∗t y p e s i z e ;
tmp_obj
= buf_obj
+ j ∗l d i m∗t y p e s i z e ;
memcpy( t m p _ l o c a l , tmp_obj , m∗t y p e s i z e ) ;
}
cublasGetMatrix ( m, n , t y p e s i z e ,
buf_obj ,
ldim ,
buf_local , ldim_buf ) ;
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
/∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗∗/
Necessary GPU CPU transfers
Data is transferred to RAM prior to a communication.
Data is transferred to GPU after each communication.
Retargeting PLAPACK to Clusters of GPUs
15
Francisco Igual
Experimental results
The L ONGHORN visualization cluster
Per Node
Number of cores
CPU
Available memory
Per System (256 nodes)
8
2,048
Intel Xeon Nehalem @ 2.53 GHz
48 Gbytes
13.5 TBytes
Interconnection network
QDR Infiniband
Graphics system
GPU
Interconnection bus
Available video memory
128 NVIDIA Quadro Plex S4s
2 x NVIDIA Quadro FX5800
512 x NVIDIA Quadro FX5800
PCIExpress 2.0 (8x)
8 Gbytes DDR3
2 TBytes DDR3
Peak performance (SP)
Peak performance (DP)
161.6 GFLOPS
80.8 GFLOPS
41.40 TFLOPS
20.70 TFLOPS
Table: Detailed features of the L ONGHORN cluster.
PLAPACK: version R32.
BLAS: N VIDIA CUBLAS 2.2, M KL 10.1.
MPI: M VAPICH 1.4.
Retargeting PLAPACK to Clusters of GPUs
16
Francisco Igual
Experimental results
GEMM: performance
CUPLAPACK. Matrix-matrix multiplication on LONGHORN
9000
32 Quadro FX5800
16 Quadro FX5800
8 Quadro FX5800
4 Quadro FX5800
2 Quadro FX5800
1 Quadro FX5800
8000
7000
GFLOPS
6000
5000
4000
3000
2000
1000
0
0
10000
20000
30000
40000
50000
60000
Matrix size (m=n=k)
Figure: Performance of CUPLAPACK for the matrix-matrix multiplication on
L ONGHORN.
Retargeting PLAPACK to Clusters of GPUs
17
Francisco Igual
Experimental results
Cholesky factorization: performance
CUPLAPACK. Cholesky factorization on LONGHORN
5000
32 Quadro FX5800
16 Quadro FX5800
8 Quadro FX5800
4 Quadro FX5800
2 Quadro FX5800
1 Quadro FX5800
4000
GFLOPS
3000
2000
1000
0
0
20000
40000
60000
Matrix size
80000
100000
Figure: Performance of CUPLAPACK for the Cholesky factorization on
L ONGHORN.
Retargeting PLAPACK to Clusters of GPUs
18
Francisco Igual
Experimental results
GEMM: scalability
CUPLAPACK. Matrix-matrix multiplication on LONGHORN
32 Quadro FX5800
16 Quadro FX5800
8 Quadro FX5800
4 Quadro FX5800
2 Quadro FX5800
1 Quadro FX5800
30
25
Speedup
20
15
10
5
0
0
10000
20000
30000
40000
Matrix size (m=n=k)
50000
60000
Figure: Scalability of the PLAPACK-based codes for the matrix-matrix
multiplication on L ONGHORN. The reference is the performance of C UBLAS
S GEMM on one GPU.
Retargeting PLAPACK to Clusters of GPUs
19
Francisco Igual
Experimental results
GEMM: PLAPACK vs. CUPLAPACK
CUPLAPACK. Matrix-matrix multiplication on LONGHORN (16 nodes)
7000
CUPLAPACK - 32 Quadro FX5800
CUPLAPACK - 16 Quadro FX5800
PLAPACK - 128 Intel Xeon Nehalem Cores
6000
GFLOPS
5000
4000
3000
2000
1000
0
0
5000
10000
15000 20000 25000
Matrix size (m=n=k)
30000
35000
40000
Figure: GEMM on 16 nodes of L ONGHORN. PLAPACK vs CUPLAPACK.
Retargeting PLAPACK to Clusters of GPUs
20
Francisco Igual
Experimental results
Cholesky factorization: PLAPACK vs. CUPLAPACK
CUPLAPACK. Cholesky factorization on LONGHORN (16 nodes)
3500
CUPLAPACK - 32 Quadro FX5800
CUPLAPACK - 16 Quadro FX5800
PLAPACK - 128 Intel Xeon Nehalem Cores
3000
GFLOPS
2500
2000
1500
1000
500
0
0
10000
20000
30000
Matrix size
40000
50000
60000
Figure: Cholesky factorization on 16 nodes of L ONGHORN. PLAPACK vs
CUPLAPACK.
Retargeting PLAPACK to Clusters of GPUs
21
Francisco Igual
Experimental results
Multiple GPUs per node vs. One GPU per node
CUPLAPACK. Cholesky factorization on LONGHORN
5000
32 Quadro FX5800 on 32 nodes
32 Quadro FX5800 on 16 nodes
4000
GFLOPS
3000
2000
1000
0
0
20000
40000
60000
Matrix size
80000
100000
Figure: Cholesky factorization on 32 GPUs of L ONGHORN, using one or two
GPUs per node.
Retargeting PLAPACK to Clusters of GPUs
22
Francisco Igual
Conclusions and future work
Conclusions and future work
Conclusions
Retarget of PLAPACK to hybrid architectures.
Transparent for programmers and library developers.
Benefits from the layered design of PLAPACK.
Remarkable performance and scalability on large GPU clusters.
Future work
Port the Elemental framework to GPUs.
Out-of-core + GPU support to deal with even larger problems.
Retargeting PLAPACK to Clusters of GPUs
23
Francisco Igual
Conclusions and future work
Acknowledgements
Thank you. . .
Questions?
We thank the Texas Advanced Computing Center for granting access to the
platform where the experiments were conducteda .
a http://www.tacc.utexas.edu/resources/visualization/#longhorn
Retargeting PLAPACK to Clusters of GPUs
24
Francisco Igual

Documentos relacionados