|
|
Acceleration and parallelizationTable of contentsHere we give an overview on the parallelization and acceleration schemes employed by GROMACS. The aim is on the one hand, to provide an understanding of the underlying mechanism that make GROMACS one of the fastest molecular dynamics packages. On the other hand, the information presented should help choosing the appropriate parallelization options, run configuration, as well as acceleration options in order to achieve optimal simulation performance.
Terms and definitions
AccelerationSSE, AVX, etcTBA GPU accelerationGROMACS 4.5 introduced the first version of GPU acceleration based on the OpenMM library. This version executed the entire simulation on the GPU and doesn't use the CPU resources for anything but input-ouput. While this approach avoids the CPU-GPU communication bottleneck, it only supports a small subset of all GROMACS features and delivers substantial speedup compared to CPU runs only in case of implicit solvent simulations.
With GROMACS 4.6, native GPU acceleration support is introduced. The most compute-intensive part of simulations, the non-bonded force calculation can be offloaded a GPU and carried out simultaneously with CPU calculations of bonded forces and PME eletrostatics. Native GPU acceleration is supported with the verlet cut-off scheme (not with the group scheme) with PME, reaction-field, and plain cut-off electrostatics.
The non-bonded GPU kernels are implemented in CUDA run on NVIDIA hardware with compute capability 2.0 (“Fermi”) and newer. Although low-end GPUs will work, typically at least a mid-class consumer GPU is needed to achieve speedup compared to CPU-only runs on a recent processor.For optimal performance with multiple GPUs, especially in multi-node runs, it is advised that identical hardware is used.
The native GPU accelerations can be turned on or off using the GMX_GPU CMake variable. Parallelization schemesGROMACS being performance oriented has a strong focus on efficient parallelization. As of v4.6 there are multiple parallelization schemes available, therefore a simulation can be run on a given hardware with different choice of run configuration. Here we describe the different schemes employed in GROMACS 4.6 highlighting the differences and providing a guide for running efficient simulations.
MPIParallelization based on MPI has been part of GROMACS from the early versions hence is compatible with the majority of MD algorithms. At the heart of the MPI parallelization is the neutral-territory domain decomposition which supports fully automatic dynamic load balancing. To parallelize simulations across multiple machines (e.g. nodes of a cluster) mdrun needs to be compiled with MPI which is controlled by the GMX_MPI CMake variable.
Multithreading with thread-MPIThe thread-MPI library provides mutithreaded implementation of the MPI 1.1 specification's subset required in GROMACS. Both POSIX pthreads and Windows threads implementation provides great portability to most UNIX/Linux and Windows operating systems. Acting as a drop-in replacement for MPI, thread-MPI makes possible compiling and running mdrun on a single machine without needing MPI. Additionally, it not only provides a convenient way to use computers with multicore CPU(s), but the thread-MPI enabled mdrun also runs slightly faster than with MPI.
Thread-MPI is included in the GROMACS source and it is the default parallelization since v4.5 practically rendering the serial mdrun deprecated. Compilation with thread-MPI is controlled by the GMX_THREAD_MPI CMake variable.
By default the thread-MPI mutithreaded mdrun will use all available cores in the machine by starting as many threads as the number of cores. The number of threads can be controlled using the -nt option. Note that in v4.5.x if the number of threads mdrun uses is equal with the total number of cores, each thread gets locked to "its" core.
Multi-level parallelization: MPI and OpenMPThe multi-core trend in CPU development substantiates the need for multi-level parallelization. Current multiprocessor machines can have 2-4 CPUs with a core count as high as 64. As the memory and cache subsystem is lagging more and more behind the multicore evolution, this emphasizes non-uniform memory access (NUMA) effects which can become limiting factor to performance. At the same time, all cores share a network interface. In a purely MPI-parallel scheme all MPI processes use the same network interface, and although MPI intra-node communication is generally efficient, communication between nodes can become a limiting factor to parallelization. This is especially pronounced in the case of highly parallel simulations with PME (which is very communication intensive) and with "fat" nodes conected by a slow network. Multi-level parallelism aims to address the NUMA and communication related issues by employing efficient intra-node parallelism, typically multithreading.
With GROMACS 4.6 OpenMP multithreading is supported in mdrun and combined with MPI (or thread-MPI) it enables multi-level and heterogeneous parallelization. While the verlet cut-off scheme full OpenMP multithreading is implemented, the “group” scheme only supports OpenMP threading for PME.
OpenMP is enabled by default in GROMACS 4.6 and can be turned on/off with the GMX_OPENMP CMake variable.
While the OpenMP implementation itself is quite efficient (up to 12-16 threads), when combining with MPI it has an additional overhead especially when running separate multi-threaded PME nodes. Depending on the architecture, input system size, as well as other factors, MPI+OpenMP runs can be as fast and faster already at small number of processes (e.g. multi-processor Intel Westmere) but can also be considerably slower (e.g. multi-processor AMD Interlagos machines). However, there is a more pronounced benefit of multi-level parallelization in highly parallel runs. Hybrid/heterogeneous parallelizationGROMACS 4.6 introduces hybrid parallelization by making use of GPUs to accelerate non-bonded force calculation. Along the verlet cut-off scheme new non-bonded algorithms have been developed with the aim of efficient acceleration both on CPUs and GPUs.
To efficiently use all compute resource available, CPU and GPU computation is done simultaneously. Overlapping with the OpenMP multithreaded bonded force and PME long-range electrostatics on the CPU, non-bonded forces are calculated on the GPU. Multiple GPUs, both in a single node as well as across multiple nodes, are supported using domain-decomposition. A single GPU is assigned to the non-bonded workload of a domain, therefore, the number GPUs used has to match the number of of MPI processes (or thread-MPI threads) the simulation is started with. That the available CPU cores are partitioned among the processes (or thread-MPI threads) and a set of cores with a GPU do the calculations on the respective domain.
With PME simulations mdrun supports automated CPU-GPU load-balancing by shifting workload between the real- and reciprocal-space part of electrostatics. At startup a few hundred iterations of tuning are executed which involves scaling the electrostatics cut-off to determine the value that gives optimal CPU-GPU load balance. The cut-off value provided using the rcoulomb mdp option represents the minimum electrostatics cut-off the tuning starts with and therefore should be chosen as small as possible (but still reasonable for the physics simulated).
While the automated CPU-GPU load balancing always attempts to find the optimal cut-off setting, it might not always be possible to balance CPU and GPU workload. This happens when the CPU threads finish calculating the bonded forces and PME faster than the GPU the non-bonded force calculation, even with the shortest possible cut-off. In such cases the CPU will wait for the GPU and this time will show up as "Wait GPU local" in the cycle and timing summary table at the end of the log file as shown below.
R E A L C Y C L E A N D T I M E A C C O U N T I N G
Computing: Nodes Th. Count Seconds G-Cycles %
-----------------------------------------------------------------------------
Neighbor search 1 4 26 0.145 1.866 5.2
Launch GPU ops. 1 4 501 0.035 0.448 1.2
Force 1 4 501 0.338 4.349 12.0
PME mesh 1 4 501 1.365 17.547 48.5
Wait GPU local 1 4 501 0.162 2.083 5.8
NB X/F buffer ops. 1 4 1002 0.128 1.645 4.6
Write traj. 1 4 1 0.180 2.309 6.4
Update 1 4 501 0.072 0.924 2.6
Constraints 1 4 501 0.322 4.147 11.5
Rest 1 0.065 0.833 2.3
-----------------------------------------------------------------------------
Total 1 2.811 36.152 100.0
-----------------------------------------------------------------------------
Separate PME nodesBy default, particle-particle (PP) and PME calculations are done in the same process one after another. As PME requires heavy global communication, this is most of the time the limiting factor to scaling on a large number of cores. By designating a subset of nodes for PME only, performance of parallel runs can greatly improve.
Using separate PME nodes has been possible since version 4.0 and with GROMACS 4.6 using OpenMP mutithreading in PME nodes is also possible.
Running simulationsBelow are examples that aim to show how the different parallelization schemes can be used. We assume default mdrun options wherever the explicit values are not specified. Additionally, in the examples the MPI-enabled mdrun_mpi (with default prefix), otherwise mdrun will be used. Note that all features available with MPI are also supported with thread-MPI so whenever "process" or "MPI process" is Note that the information below will change as we refine and make easier to use the new features during the 4.6 pre-beta and beta phase. In case if something does not work as advertised, feel free to write to the gmx-users mailing list.
MPI, Thread-MPI Assuming a standard MPI installation with mpirun tool, launch a simulation with N processes is possible with: mpirun -np N mdrun_mpi Equivalent command using thread-MPI is: mdrun -nt N In these cases, when N>=12 mdrun will automatically use separate PME nodes both with MPI and thread-MPI. To prevent this, use the -npme 0 option. OpenMPThe OpenMP mutithreading alone, without domain decomposition, with up to 8-12 threads is nearly as efficient as thread-MPI. To choose the number of threads to use, set the OMP_NUM_THREADS environment variable. Alternatively the GMX_MAX_THREADS environment variable sets the maximum number of allowed threads per node. So assuming that there are N cores available, the following two commands are equivalent: OMP_NUM_THREADS=N mdrun -nt 1 GMX_MAX_THREADS=N mdrun -nt 1
Note that the current hardware locality detection and automation of process/thread affinity is implemented in a quite simple and somewhat naive way. With OpenMP by default threads get pinned to cores in a machine in a sequential fashion. Therefore, if one wants to run mutiple processes per node, with the default settings the two processes will clash. To get around this issue, one can disable thread pinning by setting the GMX_NO_THREAD_PINNING environment variable and use manual process affinity setting (e.g. with taskset). The final 4.6 release will have automated hardware locality detection. Multi-level parallelization: MPI/thread-MPI + OpenMPCombining MPI/thread-MPI with OpenMP has a considerable overhead. Threrefore, at the moment, the multi-level parallelization will surpass the MPI/thread-MPI-only parallelization in case of highly parallel runs and/or with a slow network. We refer to the verlet scheme unless explicitly stated as only this scheme has full OpenMP support. Launching M MPI processes with N OpenMP threads each: OMP_NUM_THREADS=N mdrun -nt M OMP_NUM_THREADS=N mpirun -np M mdrun_mpi Note that for good performance on multi-socket servers, groups of OpenMP threads belonging to an MPI process/thread-MPI thread should run on the same CPU/socket. This requires that the number of processes is a multiple of the number of CPUs/sockets in the respective machine and the number of cores per CPU is divisible by the number of threads per process. E.g. on a dual 6-core machine N=6, M=2 or N=3, M=4 should run more efficiently than N=4 M=3. Using the GMX_MAX_THREADS=MAXC environment variable it is possible to set the maximum number of threads (and consequently cores) to be used. The partitioning of cores between the processes running on the respective node is done automatically. GMX_MAX_THREADS=MAXC mdrun -nt M OMP_NUM_THREADS=MAXC mpirun -np M mdrun_mpi If there are less than MAXC cores available per node, the number of available cores detected overrides MAXC. Note that GMX_MAX_THREADS has the lowest priority for setting the number of OpenMP threads and OMP_NUM_THREADS or GMX_XXX_NUM_THREADS will always override it.
Running seaprate, multi-threaded PME nodes is supported in both cut-off schemes. To set the number of threads for PME only independently from the number of threads in the rest of the code, use the GMX_PME_NUM_THREADS environment variable. As opposed to OMP_NUM_THREADS which sets the global/default number of threads throughout the simulation, GMX_PME_NUM_THREADS makes possible overrides the default value. While with the verlet scheme it is mandatory to always set the global number of threads if the number of PME threads is set, with the group scheme it is enough set either OMP_NUM_THREADS or GMX_PME_NUM_THREADS (both affect only PME).
Examples:
OMP_NUM_THREADS=NT mpirun -np NP_tot mdrun_mpi -npme NP_pme will run NP_tot processes out of which NP_pme dedicated for PME using NT threads for both PP and PME, while
OMP_NUM_THREADS=NT GMX_PME_NUM_THREADS=NT_pme mpirun -np NP_tot mdrun_mpi -npme NP_pme will use NT threads in PP nodes and NT_pme threads in PME nodes.
The above comments with OpenMP and thread pinning apply in case of multi-level parallelization as well, similarly to the single process case, with multiple processes per node threads still get locked on cores in a sequential fashion: the N threads of the first process to the first N cores, the N threads of the second process to cores N+1-2N, and so on. As a consequence, using Intel Hyperthreading is not possible, unless thread pinning is turned off and the right per-process affinities are set at launch. The final 4.6 release will have automated hardware locality detection.
Heterogenous parallelization: using GPUsUsing GPU acceleration is pretty much as simple as compiling mdrun with the CMake variable GMX_GPU=ON and using a tpr file with verlet scheme on a machine with supported GPU(s). Therefore, all the above instructions regarding OpenMP and MPI/thread-MPI + OpenMP runs apply with GPU accelerated runs. The restriction is that the number of nodes (PP+PME or PP-only with separate PME) has to be equal to the number of available GPUs. Consequently, you need to make sure to start as many MPI processes or thread-MPI threads as the number of GPUs intended to be used. For instance, with an 8-core machine and two GPUs the launch command will be: mdrun -nt 2 both OMP_NUM_THREADS and GMX_MAX_THREADS is optional in this case. However, if one wants to save 2 cores for other purposes the launch command will be [OMP_NUM_THREADS=3|GMX_MAX_THREADS=6] In order to manually specify which GPU(s) to be used by mdrun, the respective device ID(s) can be passed in the GMX_GPU_ID envirinment variable as a sequence of digits. GMX_GPU_ID=1 mdrun -nt 1 # use the second device GMX_GPU_ID=02 mdrun -nt 2 # skip the 3rd device
Currently the automation of GPU to process assignment is fairly primitive, processes will pick the GPUs sequentially meaning that the process ID within the node will match the GPU ID. Although this scheme works well in most cases, it doesn't allow for a flexible choice and assigment of GPUs based on their hardware locality and performance losses might occur. The final 4.6 release will have automated hardware locality detection. |