In-situ Assessment of Device-side Compute Work for Dynamic Load Balancing in a GPU-accelerated PIC Code

Michael Rowan
Work with Kevin Gott, Axel Huebl, Jack Deslippe
See preprint here: https://arxiv.org/abs/2104.11385
PASC '21 — 07.05.2021

Outline:

Load balancing intro
Dynamic load balancing in
PIC code run on GPUs

GPU-accelerated machines entered the TOP500 rankings just over a decade ago.

How do we get optimal performance from these supercomputers?

Compilers
Algorithms/data structures
Load balancing

Particle-mesh codes parallelize via domain decomposition.

Compute
work

Rank

Particle-mesh simulations can suffer from load imbalance.

Compute
work

Rank

Particle-mesh simulations can suffer from load imbalance.

Compute
work

Rank

Load imbalance can be corrected at run time.

Basic load balance algorithm for distributed memory particle-mesh:


              if (step % loadBalanceInterval == 0) {
                  float currEff = 0.0, propEff = 0.0;
                  DistMapping newDM = makeNewDM(costs,
                                                currEff, propEff);
                  bool globUpdateDM = false;
                  if (myRank == root) {
                      globUpdateDM = (propEff > 1.1*currEff);
                  }
                  bcast(&globUpdateDM, 1, root);
                  if (globUpdateDM) {
                      bcast(&newDM[0], newDM.size(), root);
                      updateDistributionMapping(newDM);
                  }
              }

How should costs be measured when running on a GPU-accelerated machine?

Start timer
Launch kernel
Stop timer

How should costs be measured when running on a GPU-accelerated machine?

Start timer
Launch kernel
Stop timer

Not like this! CPU and GPU are asynchronous.

These are a few strategies appropriate for cost assessment on GPU machines:

Heuristic: number of particles and cells as proxy for compute work
CUPTI: use CUDA Profiling Tools Interface to access kernel times
GPU clock: use thread-summed kernel times as relative measure of compute work

How to measure costs with heuristic?

Cost for rank $i$ is linear combination of number of particles and cells:

\[\begin{aligned} c_{i} = \alpha \cdot n_{\rm particles} + \beta \cdot n_{\rm cells} \end{aligned} \]

$\alpha$ and $\beta$ are parameters representing relative computational cost of single particle vs. single cell
$\alpha$ and $\beta$ change depending on algorithm, hardware
In general, $\alpha$ and $\beta$ should be measured
Pros: vendor agnostic, low overhead
Cons: cumbersome tuning of parameters

How to measure costs with CUPTI?

CUDA Profiling Tools Interface (CUPTI): docs.nvidia.com/cuda/cupti
GPU activity triggers callback functions to return CUPTI buffers

Pros: API enables powerful profiling capabilities
Cons: overhead, vendor specific

How to measure costs with CUPTI?

Initialize the trace:


                  cuptiActivityEnable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);
                  cuptiActivityRegisterCallbacks(bfrRequest, bfrCompleted);

Trigger callback functions:


                  void CUPTI API bfrRequest (uint8_t **bfr, ...)
                  {
                      // Signal to CUPTI client that an empty buffer
                      // is needed by CUPTI
                  }
                  void CUPTI API bfrCompleted (uint8_t *bfr, ...)
                  {
                      // Return a buffer of completed activity records
                      // to CUPTI client  
                  }

How to measure costs with CUPTI?

Initialize the trace:


                  cuptiActivityEnable(CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL);
                  cuptiActivityRegisterCallbacks(bfrRequest, bfrCompleted);

Trigger callback functions:


                  void CUPTI API bfrRequest (uint8_t **bfr, ...)
                  {...}
                  void CUPTI API bfrCompleted (uint8_t *bfr, ...)
                  {...}
                  
                  ⋮
                  
                  mykernel<<<...>>>(...);
                  cuptiActivityFlushAll(0); // Wait for return of CUPTI 
                  ➞ bfrCompleted(...);     // records via callback function

How to measure costs with GPU clock?

Estimate relative compute work from thread-summed kernel time

Pros: vendor agnostic, no hyperparameter tuning
Cons: requires some data movement

How to measure costs with GPU clock?

Add the thread cycles, using atomicAdd for thread safety:


                  __global__ void mykernel (...) {
                      float cycles = clock();
                      ⋮
                      // thread work
                      ⋮
                      cycles = clock() - cycles;

                      // cost_ptr is the pointer to rank's cost
                      atomicAdd(cost_ptr, cycles);
                  }

Reduced overhead using pinned host memory
To use this: instrument most expensive kernels
Overcomes weakness of heuristic: that has no sensitivity to how much particles move

Outline:

Load balancing intro
Dynamic load balancing in
PIC code run on GPUs

We studied these strategies in the particle-in-cell code WarpX.

WarpX
advanced physics

AMReX
mesh infrastructure, algorithms

MPI

CUDA, OpenMP, DPC++, HIP

WarpX: advanced PIC code

github.com/ECP-WarpX/WarpX

AMReX: mesh framework

github.com/AMReX-Codes/amrex

Courtesy of Max Thevenet

We choose laser-ion acceleration as a challenging test problem.

Rapid changes in particle, field spatial profiles $\rightarrow$ challenge problem

Numerical experiments: 6–6144 Nvidia V100 GPUs on OLCF Summit

The inhomogeneity translates to different computational costs.

Computational costs are used to compute optimal mapping from MPI rank to domain.

Knapsack: distribute costs to ranks as equally as possible
Space-filling curve (SFC): enumerate ranks along curve and partition

Dynamic load balancing is crucial to performance.

Static load balancing
is not enough!

Efficiency: average cost/mean cost

\[\begin{aligned} E \equiv c_{\rm avg}/c_{\rm max} \end{aligned} \]

With optimal selection of parameters, we achieve around 3x–4x speedup.

Optimal performance with:

GPU clock cost collection
Knapsack algorithm
9 boxes per GPU
10 steps to check rebalance
10% improvement threshold

1.2x speedup over static lb
3.8x speedup over no lb

How much improvement expected from load balancing?

Performance model w/ strong-scaling as input:

\[\begin{aligned} S = \left(\frac{c_{\rm max0}}{c_{\rm avg0}}\right)^{x} = \left(\frac{1}{E_0}\right)^{x} \end{aligned} \]

Estimate speedup $S$ as $\propto$ initial load imbalance

The load balancing scheme achieves 62%–74% of theoretical maximum.

Avoid out-of-memory on GPUs with load balancing!

What is new/innovative about this work?

Introduced GPU-applicable strategies for measuring relative computational costs of sub-domains of computational work
Implemented potentially vendor-neutral, in-situ, in-kernel cost measurement strategy based on GPU clock
Implemented Nvidia CUPTI cost measurement $\rightarrow$ overhead
Demonstrated effective GPU dynamic load balancing running challenging use case WarpX at scale (6-6144 GPUs) on Summit
Introduced strong-scaling based performance model

With new strategies for GPU cost assessment, we achieved 3x–4x speedup on challenging plasma physics problem.

Work is open source:

WarpX: github.com/ECP-WarpX/WarpX

AMReX: github.com/AMReX-Codes/amrex

Code, environment, tests all available at:

https://zenodo.org/record/4708449#.YIEmmJNKhR0

See preprint here:

https://arxiv.org/abs/2104.11385

Personal github:

https://github.com/mrowan137

Thank you! I am happy to answer any questions.

Performance is tuned with additional algorithm-specific parameters.

Heuristic, GPU clock, CUPTI : cost collection method
Knapsack, SFC : algorithm to update distribution mapping
Boxes per GPU : controls size of domain decomposition
Load balance interval : how often to try rebalancing
Improvement threshold : required improvement to rebalance