恅紫刲坰 > SWARM: A Parallel Programming Framework for Multicore Processors

SWARM: A Parallel Programming Framework for Multicore Processors

Page 1
SWARM: A Parallel Programming Framework for Multicore Processors
David A. Bader, Varun N. Kanade and Kamesh Madduri

Page 2
2
Our Contributions
? SWARM: SoftWare and Algorithms for Running on Multicore, a portable open-
source parallel framework for multicore systems
每 http://multicore-swarm.sourceforge.net/
? A library of fundamental parallel primitives: prefix-sums, list ranking, sorting and selection, symmetry breaking etc. ? Computational model for analyzing algorithms on multicore systems

Page 3
3
Talk Outline
? Introduction ? Model for algorithm design on multicore systems
每 Case studies 每 Performance
? SWARM
每 Programming framework 每 Algorithm design 每 Performance

Page 4
4
Multicore Computing
High-performance multicore programming requirements
每 Exploit concurrency at the algorithmic level, and design efficient parallel algorithms 每 Judiciously utilize memory bandwidth 每 minimize inter-processor communication, synchronization

Page 5
5
Multicore Computing
High-performance multicore programming requirements
每 Exploit concurrency at the algorithmic level, and design efficient parallel algorithms 每 Judiciously utilize memory bandwidth 每 minimize inter-processor communication, synchronization
Model for multicore algorithm design
SWARM: A Parallel Programming Framework

Page 6
6
Multicore Algorithm Design
? Architectural model: p homogeneous processing cores,
dedicated L
1
cache, shared L
2
cache ? Memory Bandwidth of L
1
cache > L
2
cache > main memory

Page 7
7
Complexity Model
Algorithm complexity is given by
? Computational Complexity:
每 RAM model of computation, complexity as a function of input size
? Memory accesses:
每 B: no. of blocks transferred from main memory to shared L2 cache, 考: bandwidth parameter 每 Aggarwal-Vitter I/O model of computation
? Synchronization:
每 S(n): complexity of synchronization operations (barriers, locks), L: synchronization overhead parameter
p ipnT (n,p) T
ci i c
,...,1),,( max = =
1
)(
?
=

nB (n)T
M
LnS (n)T
s
)( =

Page 8
8
Multicore Algorithm Design
? Complexity is given by the tuple <T
C,
T
M,
T
S
> ? Cache-aware approaches
每 Data layout optimizations, blocking/tiling, padding 每 Merge Sort case-study
? Cache-oblivious algorithms ? Minimize synchronization overhead
每 Lock-free algorithms, atomic operations
? All these paradigms considered in SWARM parallel primitives and library design

Page 9
9
Test Platform
? Sun Fire T2000 (UltraSparc T1 Niagara processor)
每 8 multithreaded cores 每 4 threads per core 每 Shared 3MB on-chip L2 cache 每 1.0GHz clock speed

Page 10
10
Problem Size (in log scale)
15 16 17 18 19 20 21 22 23 24 25 26 27 28
Speedup achieved by the cache-aw are approach over the naive approach
0.96 0.98 1.00 1.02 1.04 1.06 1.08 1.10 1.12 1.14 2 cores, 1 thread per core 4 cores, 1 thread per core 8 cores, 1 thread per core 8 cores, 2 threads per core
Tiling optimization benchmark: Performance on Sun Fire T2000

Page 11
11
SWARM
? Multicore programming framework and a collection of multicore-optimized libraries ? Portable, open-source
每 http://multicore-swarm.sourceforge.net/ 每 Current: version 1.1
? Examples and efficient implementations of various parallel programming primitives
每 Prefix-sums, pointer-jumping, divide and conquer, pipelining, graph algorithms, symmetry breaking

Page 12
12
SWARM
? POSIX-threads based framework ? Support for parallel primitives
每 data parallel, control, memory management 每 barrier, replicate, scan, broadcast
? Incremental approach to parallelizing applications
每 Minimal modifications to existing code

Page 13
13
Typical SWARM Usage
int main (int argc, char **argv) {
SWARM_Init(&argc, &argv);
/* sequential code */ .... ....
SWARM_Run((void *) fun1(a, b));
/* more sequential code */ fun2(c,d); ....
SWARM_Finalize();
}
? C code
int main (int argc, char **argv) { /* sequential code */ .... .... fun1(a, b); /* more sequential code */ fun2(c,d); .... }
Identify compute- intensive functions Parallelize with SWARM library

Page 14
14
Data parallelism: pardo directive
/* pardo example: partitioning a "for" loop among the cores */
pardo(i, start, end, incr) {
A[i] = B[i] + C[i]; }
? pardo: Parallel do, implicitly partitions a loop
among the cores without the need for coordinating. ? SWARM provides both block and cyclic partitioning options

Page 15
15
Control
? SWARM control primitives restrict threads that can participate in a context.
// THREADS: total number of execution threads // MYTHREAD: the rank of a thread, from 0 to // THREADS-1 /* example: execute code on a specific thread */
on_thread (3) {
.... .... } /* example: execute code on just one thread */
on_one_thread {
... ... }

Page 16
16
Memory management
? SWARM provides two directives
每 SWARM_ malloc: dynamically allocate a shared
structure
每 SWARM_free: release shared memory back to
the heap
/* example: allocate a shared array of size n */ A = (int *)SWARM_malloc(n*sizeof(int),TH); /* example: free the array A */
SWARM_free(A);

Page 17
17
Synchronization
? Barrier: SWARM_Barrier()
每 Two variants
? Locks
每 POSIX threads Mutex locks, user-level atomic locks

Page 18
18
Other communication primitives
? Broadcast: supplies each processing core with the address
of the shared buffer by replicating the memory address.
? Reduce: performs a reduction operation with a binary
associative operator, such as addition, multiplication, maximum, minimum, bitwise-AND, and bitwise-OR
/* function signatures */ int SWARM_Bcast_i (int myval, THREADED); int* SWARM_Bcast_ip (int* myval, THREADED); char SWARM_Bcast_c (char myval, THREADED); /* function signatures */ int SWARM_Reduce_i(int myval, reduce_t op, THREADED); double SWARM_Reduce_d(double myval, reduce_t op, THREADED);

Page 19
19
SWARM: Merge Sort Performance, Sun Fire T2000

Page 20
20
SWARM: List Ranking Performance, Sun Fire T2000

Page 21
21
SWARM: List Ranking Performance, Intel dual-core Xeon 5150

Page 22
22
Acknowledgment of Support
? National Science Foundation
每 CSR: A Framework for Optimizing Scientific Applications (06-14915) 每 CAREER: High-Performance Algorithms for Scientific Applications (06-11589; 00-93039) 每 ITR: Building the Tree of Life -- A National Resource for Phyloinformatics and Computational Phylogenetics (EF/BIO 03-31654) 每 ITR/AP: Reconstructing Complex Evolutionary Histories (01-21377) 每 DEB Comparative Chloroplast Genomics: Integrating Computational Methods, Molecular Evolution, and Phylogeny (01-20709) 每 ITR/AP(DEB): Computing Optimal Phylogenetic Trees under Genome Rearrangement Metrics (01-13095) 每 DBI: Acquisition of a High Performance Shared-Memory Computer for Computational Science and Engineering (04-20513).
? IBM PERCS / DARPA High Productivity Computing Systems (HPCS)
每 DARPA Contract NBCH30390004
? IBM Shared University Research (SUR) Grant ? Sony-Toshiba-IBM (STI) ? Microsoft Research ? Sun Academic Excellence Grant

Page 23
23
Conclusions
? SWARM: SoftWare and Algorithms for Running on Multicore, a portable open-source parallel
framework for multicore systems
每 http://multicore-swarm.sourceforge.net/
? We present a complexity model for algorithm design on multicore systems
每 It is critical to optimize memory access patterns and synchronization on multicore systems
? Future work: more SWARM libraries and primitives

扢峈忑珜 | 樓輮梐 | 偕詼刲坰

All Rights Reserved Powered by 恅紫狟婥厙

Copyright © 2011
恅紫狟婥厙囀暌棚奜讕蝤畏觬陎硊裔赮迓疰Щ肢窗ㄅousu#anggang.com
殿隙階窒