SWARM: A Parallel Programming Framework
for Multicore Processors
David A. Bader, Varun N. Kanade and Kamesh Madduri
2
Our Contributions
? SWARM: SoftWare and Algorithms for
Running on Multicore, a portable open-
source parallel framework for multicore
systems
每 http://multicore-swarm.sourceforge.net/
? A library of fundamental parallel primitives:
prefix-sums, list ranking, sorting and
selection, symmetry breaking etc.
? Computational model for analyzing
algorithms on multicore systems
3
Talk Outline
? Introduction
? Model for algorithm design on multicore
systems
每 Case studies
每 Performance
? SWARM
每 Programming framework
每 Algorithm design
每 Performance
4
Multicore Computing
High-performance multicore programming
requirements
每 Exploit concurrency at the algorithmic level, and
design efficient parallel algorithms
每 Judiciously utilize memory bandwidth
每 minimize inter-processor communication,
synchronization
5
Multicore Computing
High-performance multicore programming
requirements
每 Exploit concurrency at the algorithmic level, and
design efficient parallel algorithms
每 Judiciously utilize memory bandwidth
每 minimize inter-processor communication,
synchronization
Model for multicore algorithm design
SWARM: A Parallel Programming Framework
6
Multicore Algorithm Design
? Architectural model:
p homogeneous processing cores,
dedicated L
1
cache, shared L
2
cache
? Memory Bandwidth of
L
1
cache > L
2
cache > main memory
7
Complexity Model
Algorithm complexity is given by
? Computational Complexity:
每 RAM model of computation, complexity as a function of
input size
? Memory accesses:
每 B: no. of blocks transferred from main memory to shared
L2 cache, 考: bandwidth parameter
每 Aggarwal-Vitter I/O model of computation
? Synchronization:
每 S(n): complexity of synchronization operations (barriers,
locks), L: synchronization overhead parameter
p
ipnT
(n,p)
T
ci
i
c
,...,1),,(
max
=
=
1
)(
?
=
考
nB
(n)T
M
LnS
(n)T
s
)(
=
8
Multicore Algorithm Design
? Complexity is given by the tuple <T
C,
T
M,
T
S
>
? Cache-aware approaches
每 Data layout optimizations, blocking/tiling, padding
每 Merge Sort case-study
? Cache-oblivious algorithms
? Minimize synchronization overhead
每 Lock-free algorithms, atomic operations
? All these paradigms considered in SWARM parallel
primitives and library design
9
Test Platform
? Sun Fire T2000 (UltraSparc T1 Niagara
processor)
每 8 multithreaded cores
每 4 threads per core
每 Shared 3MB on-chip L2 cache
每 1.0GHz clock speed
10
Problem Size (in log scale)
15 16 17 18 19 20 21 22 23 24 25 26 27 28
Speedup achieved by the cache-aw
are
approach over the naive approach
0.96
0.98
1.00
1.02
1.04
1.06
1.08
1.10
1.12
1.14
2 cores, 1 thread per core
4 cores, 1 thread per core
8 cores, 1 thread per core
8 cores, 2 threads per core
Tiling optimization benchmark:
Performance on Sun Fire T2000
11
SWARM
? Multicore programming framework and a
collection of multicore-optimized libraries
? Portable, open-source
每 http://multicore-swarm.sourceforge.net/
每 Current: version 1.1
? Examples and efficient implementations of
various parallel programming primitives
每 Prefix-sums, pointer-jumping, divide and conquer,
pipelining, graph algorithms, symmetry breaking
12
SWARM
? POSIX-threads based framework
? Support for parallel primitives
每 data parallel, control, memory management
每 barrier, replicate, scan, broadcast
? Incremental approach to parallelizing
applications
每 Minimal modifications to existing code
13
Typical SWARM Usage
int main (int argc, char **argv) {
SWARM_Init(&argc, &argv);
/* sequential code */
....
....
SWARM_Run((void *) fun1(a, b));
/* more sequential code */
fun2(c,d);
....
SWARM_Finalize();
}
? C code
int main (int argc, char **argv) {
/* sequential code */
....
....
fun1(a, b);
/* more sequential code */
fun2(c,d);
....
}
Identify compute-
intensive functions
Parallelize with
SWARM library
14
Data parallelism: pardo directive
/* pardo example: partitioning a
"for" loop among the cores */
pardo(i, start, end, incr) {
A[i] = B[i] + C[i];
}
? pardo:
Parallel do, implicitly partitions a loop
among the cores without the need for
coordinating.
? SWARM provides both block and cyclic
partitioning options
15
Control
? SWARM control primitives restrict threads that can
participate in a context.
//
THREADS: total number of execution threads
//
MYTHREAD: the rank of a thread, from 0 to
// THREADS-1
/* example: execute code on a specific thread */
on_thread (3) {
....
....
}
/* example: execute code on just one thread */
on_one_thread {
...
...
}
16
Memory management
? SWARM provides two directives
每 SWARM_ malloc: dynamically allocate a shared
structure
每 SWARM_free: release shared memory back to
the heap
/* example: allocate a shared array of size n */
A = (int *)
SWARM_malloc(n*sizeof(int),TH);
/* example: free the array A */
SWARM_free(A);
17
Synchronization
? Barrier: SWARM_Barrier()
每 Two variants
? Locks
每 POSIX threads Mutex locks, user-level atomic
locks
18
Other communication primitives
? Broadcast: supplies each processing core with the address
of the shared buffer by replicating the memory address.
? Reduce: performs a reduction operation with a binary
associative operator, such as addition, multiplication,
maximum, minimum, bitwise-AND, and bitwise-OR
/* function signatures */
int
SWARM_Bcast_i (int myval, THREADED);
int*
SWARM_Bcast_ip (int* myval, THREADED);
char
SWARM_Bcast_c (char myval, THREADED);
/* function signatures */
int
SWARM_Reduce_i(int myval, reduce_t op,
THREADED);
double
SWARM_Reduce_d(double myval, reduce_t op,
THREADED);
19
SWARM: Merge Sort Performance, Sun Fire T2000
20
SWARM: List Ranking Performance, Sun Fire T2000
21
SWARM: List Ranking Performance, Intel dual-core Xeon 5150
22
Acknowledgment of Support
? National Science Foundation
每 CSR: A Framework for Optimizing Scientific Applications (06-14915)
每 CAREER: High-Performance Algorithms for Scientific Applications (06-11589; 00-93039)
每 ITR: Building the Tree of Life -- A National Resource for Phyloinformatics and
Computational Phylogenetics (EF/BIO 03-31654)
每 ITR/AP: Reconstructing Complex Evolutionary Histories (01-21377)
每 DEB Comparative Chloroplast Genomics: Integrating Computational Methods, Molecular
Evolution, and Phylogeny (01-20709)
每 ITR/AP(DEB): Computing Optimal Phylogenetic Trees under Genome Rearrangement
Metrics (01-13095)
每 DBI: Acquisition of a High Performance Shared-Memory Computer for Computational
Science and Engineering (04-20513).
? IBM PERCS / DARPA High Productivity Computing Systems (HPCS)
每 DARPA Contract NBCH30390004
? IBM Shared University Research (SUR) Grant
? Sony-Toshiba-IBM (STI)
? Microsoft Research
? Sun Academic Excellence Grant
23
Conclusions
? SWARM: SoftWare and Algorithms for Running on
Multicore, a portable open-source parallel
framework for multicore systems
每 http://multicore-swarm.sourceforge.net/
? We present a complexity model for algorithm design
on multicore systems
每 It is critical to optimize memory access patterns and
synchronization on multicore systems
? Future work: more SWARM libraries and primitives