Michael Rowan
Work with Kevin Gott, Axel Huebl, Jack Deslippe
See preprint here: https://arxiv.org/abs/2104.11385
PASC '21 — 07.05.2021
0
1
1
0
0
1
1
0
0
1
1
0
Compute
work
Rank
0
1
1
0
0
1
1
0
Compute
work
Rank
0
1
1
0
0
1
1
1
Compute
work
Rank
0
1
1
0
Basic load balance algorithm for distributed memory particle-mesh:
if (step % loadBalanceInterval == 0) {
float currEff = 0.0, propEff = 0.0;
DistMapping newDM = makeNewDM(costs,
currEff, propEff);
bool globUpdateDM = false;
if (myRank == root) {
globUpdateDM = (propEff > 1.1*currEff);
}
bcast(&globUpdateDM, 1, root);
if (globUpdateDM) {
bcast(&newDM[0], newDM.size(), root);
updateDistributionMapping(newDM);
}
}
Not like this! CPU and GPU are asynchronous.
Cost for rank $i$ is linear combination of number of particles and cells:
\[\begin{aligned} c_{i} = \alpha \cdot n_{\rm particles} + \beta \cdot n_{\rm cells} \end{aligned} \]
CUDA Profiling Tools Interface (CUPTI): docs.nvidia.com/cuda/cupti
GPU activity triggers callback functions to return CUPTI buffers
Initialize the trace:
cuptiActivityEnable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);
cuptiActivityRegisterCallbacks(bfrRequest, bfrCompleted);
Trigger callback functions:
void CUPTI API bfrRequest (uint8_t **bfr, ...)
{
// Signal to CUPTI client that an empty buffer
// is needed by CUPTI
}
void CUPTI API bfrCompleted (uint8_t *bfr, ...)
{
// Return a buffer of completed activity records
// to CUPTI client
}
Initialize the trace:
cuptiActivityEnable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);
cuptiActivityRegisterCallbacks(bfrRequest, bfrCompleted);
Trigger callback functions:
void CUPTI API bfrRequest (uint8_t **bfr, ...)
{...}
void CUPTI API bfrCompleted (uint8_t *bfr, ...)
{...}
⋮
mykernel<<<...>>>(...);
cuptiActivityFlushAll(0); // Wait for return of CUPTI
➞ bfrCompleted(...); // records via callback function
Estimate relative compute work from thread-summed kernel time
Add the thread cycles, using atomicAdd
for thread safety:
__global__ void mykernel (...) {
float cycles = clock();
⋮
// thread work
⋮
cycles = clock() - cycles;
// cost_ptr is the pointer to rank's cost
atomicAdd(cost_ptr, cycles);
}
WarpX
advanced physics
AMReX
mesh infrastructure, algorithms
MPI
CUDA, OpenMP, DPC++, HIP
WarpX: advanced PIC code
AMReX: mesh framework
Courtesy of Max Thevenet
Rapid changes in particle, field spatial profiles $\rightarrow$ challenge problem
Numerical experiments: 6–6144 Nvidia V100 GPUs on OLCF Summit
Knapsack: distribute costs to ranks as equally as possible
Space-filling curve (SFC): enumerate ranks along curve and partition
Static load balancing
is not enough!
Efficiency: average cost/mean cost
\[\begin{aligned} E \equiv c_{\rm avg}/c_{\rm max} \end{aligned} \]
Optimal performance with:
1.2x speedup over static lb
3.8x speedup over no lb
Performance model w/ strong-scaling as input:
\[\begin{aligned} S = \left(\frac{c_{\rm max0}}{c_{\rm avg0}}\right)^{x} = \left(\frac{1}{E_0}\right)^{x} \end{aligned} \]
Estimate speedup $S$ as $\propto$ initial load imbalance
Avoid out-of-memory on GPUs with load balancing!
Work is open source:
Code, environment, tests all available at:
See preprint here:
Personal github:
Thank you! I am happy to answer any questions.
Heuristic, GPU clock, CUPTI : cost collection method
Knapsack, SFC : algorithm to update distribution mapping
Boxes per GPU : controls size of domain decomposition
Load balance interval : how often to try rebalancing
Improvement threshold : required improvement to rebalance