This article recently appeared in Issue 29 of Parallel Universe Magazine
Once the mainstay of high-performance computing, vectorization has reemerged with new importance given the increasing width of SIMD (vector) registers on each processor core: four doubles with AVX2, eight with AXV-512. Increased vector lengths create opportunities to boost performance in certain classes of software (e.g., large-scale finite element analysis to simulate the nonlinear behavior of mechanical systems). The open source research code, WARP3D, also on GitHub, focuses on such simulations to support the development of safety and life prediction methodologies for critical components found typically in energy production and related industries. The discipline emphasis (3-D nonlinear solids), open source, and manageable code size (1,400 routines) appeal to researchers in industry and academia.
WARP3D reflects the current hierarchy of parallel computing: MPI* to access multiple nodes in a cluster, threads via OpenMP* on each node, and vectorization within each thread. Realistic simulations to predict complex behavior (e.g., the internal stresses created in welding a large component) require many hours of parallel computation on current computers. The pervasive matrix operations to achieve solutions map very well onto this hierarchy of parallel computing. Increased use of vectorization at the core level provides a direct multiplying effect on performance across concurrently executing threads and MPI ranks.
This article provides a brief view into vectorization as implemented in WARP3D. Developed continuously over the last 20 years, the code adopts many of the evolving features in Fortran. We employ components of Intel® Parallel Studio XE (Intel® Fortran Compiler, Intel® Math Kernel Library, and Intel® MPI Library ) to build executables included with the open source distribution for Linux*, Windows*, and MacOS*. WARP3D follows an implicit formulation of the nonlinear finite element method that necessitates solutions for very large, evolving sets of linear equations—the PARDISO* and CPARDISO* packages in Intel Math Kernel Library provide exceptional performance. Examples described here also illustrate the new Intel® Advisor 2017 with roofline analysis that enables more convenient and in-depth exploration of loop-level performance.
Figure 1 illustrates the decomposition of a 3-D finite element model for a flange into domains of elements (one domain/MPI rank) with elements in a domain assigned to blocks. Each discrete element in the model has an associated number of generally small permanent and temporary arrays (e.g., 6x6, 24x24, 60x60, 6x60), leading to millions of such arrays present during the solution of moderate- to large-sized models.
Figure 2A shows a conventional, multilevel array of structures (AoS) for storage of element data, with the various arrays for a single element likely contiguous in memory. While conceptually convenient, the comparatively small, element-level matrices limit inner-loop vector lengths and provide no mechanism to tune for cache size effects while processing the same array type across all elements.
The alternative, blocked data structure, as shown in Figure 2B for the element stiffness matrices Ke and Be (as examples), defines a structure of arrays (SoA) scheme using dense 2-D and 3-D arrays with the leading dimension set by the number of elements assigned to the block, termed span in WARP3D. All the Ke matrices for elements in a block, and separately the Be matrices, are thus contiguous in memory.
With a master list setting block ownership by domains, the following sequence illustrates the concept of combined MPI plus thread parallelism in WARP3D:
c$OMP PARALLEL DO PRIVATE( blk )
do blk = 1, number_ele_blks
if( blk_to_domain_map(blk) .ne. myrank ) cycle
call process_a_blk( blk )
c$OMP END PARALLEL DO
WARP3D employs a simple manager-worker design for MPI to synchronize tasks across ranks. In the above, the manager (on rank 0) ensures that all workers execute the OpenMP parallel loop to process the blocks they own. The process_a_blk routine dynamically allocates working versions of blocked arrays as needed in a derived type, declared local to the routine and thus local in each thread. The generally complex, multiple levels of computational routines to build/update element matrices use vector-style coding—meaning they are unaware of the OpenMP and MPI runtime environments. The data independence of element-based computations implied here represents a key feature of the finite element method. Global arrays are accessed read-only by the threads; the few block→gather operations on one vector specify $OMP ATOMIC UPDATE for synchronization. All threading operations in WARP3D are accomplished with only a dozen loops/directives of the type shown above.
The introduction of Cray* vector-based computers motivated the development of the blocking approach described above in finite element codes first created by the U.S. national laboratories. As these vector-based computers became extinct, the codes evolved to process element blocks in parallel using threads and then across MPI ranks, with blocking retained as a means to tune for effects of various cache architectures.
An example here illustrates essential features and performance of the conventional AoS and blocking SoA approaches. A sequence of matrix multiplications updates an array for each element in the model. For a single element, matrices Be, De, and Ke in this illustration have sizes 6x60, 6x6, 60x60 (all doubles and dense). Figure 2B shows the blocked storage scheme where the large arrays are allocatable components of a Fortran derived type. In models having different element types, and thus matrix sizes, WARP3D creates homogeneous blocks with all elements of the same type. Updates to Ke have the form: Ke ← Ke + transpose(Be)DeBe. This update occurs 300M to 1B times during various simulations of the flange (De Be also change for every update). In this example, De and Ke are symmetric, thus reducing computation to only lower-triangle terms.
Figure 3A shows essential code for the conventional AoS scheme. The computational outer loop cycles sequentially over all elements in the model with different versions of the called subroutine for symmetry, asymmetry, etc., while inner loops in the subroutine cycle over the much smaller dimensions of a single element matrix. The Fortran intrinsic matmul at line 19 performs the first matrix product, with loops over the comparatively small element arrays to compute final updates for only the lower-triangle terms of Ke. The compiler unrolls the k=1,6 inner loop at lines 23 to 25 and appears to replace matmul with equivalent source code which is then compiled—based on the extensive messages about loop interchanges in the annotated optimization report issued at line 19. [Editor’s note: See “Vectorization Opportunities for Improved Performance with Intel® AVX-512” in The Parallel Universe issue 27 for advice on generating and interpreting compiler optimization reports.]
Code for the blocked SoA scheme in Figure 3B runs computational loops in this order:
The first set of loops at lines 10 to 28 computes the product db = DeBe in a provided workspace for all elements in the block. A straightforward implementation would specify a four-level loop structure. Here, the two innermost loops have manual unrolling that essentially forces the compiler to vectorize the generally longer i=1, span loop. Other arrangements explored for these loops increase runtimes; the compiler does not know the iteration count for each loop, which limits the effectiveness of loop reordering. The second set of loops at lines 30 to 42 performs the update Ke ← Ke + transpose(Be)db. Here, utsz=1820 with index vector icp to map lower-triangle terms for an element onto columns of the blocked Ke. Again, the innermost loop has been manually unrolled for vectorization i=1,span.
Both sets of loops in this routine, lines 10 to 28 and 30 to 42, have only unit stride array references and long (span) lengths set up for vectorization, a characteristic feature throughout the WARP3D code. These are generally the most time-consuming loops for element-level processing in WARP3D; multiple redesigns of the loops and array organizations have occurred over the years with the evolution of compiler optimization and processor capabilities.
Figure 4 summarizes the relative execution performance for both approaches running on a single thread, with the AoS scheme assigned a unit value. Numbers in parentheses indicate the size for arrays in the compute routines. Relevant compiler options are: ifort (17.0.2) with -O3 –xHOST –align array64byte –ftz. The computer has dual Intel® Xeon® E5-2698 v3 processors (with AVX2) with no other user software running during the tests. This processor has cache sizes L1 = 32 KB/core, L2 = 256 KB/core, and L3 = 40 MB shared. Driver code and compute routines reside in separate files to prevent inlining effects. The runtimes include efforts to call the computation routines but not driver setup times. The driver program allocates thousands of array sets for both AoS and SoA testing. On each call, the driver passes a randomly selected set to the compute routine, in all cases processing 32 M elements. Multiple runs, each completing in 2 to 4 minutes, reveal a 1 to 2 percent variability in measured CPU.
Only a single result is needed for the AoS scheme; there are no adjustable parameters. The number of elements per block (span) varies in the SoA scheme with a compensating number of blocks to maintain the identical number of elements processed in each case. The SoA results show an expected strong effect of block size driven by the interaction of vector processing lengths and L1, L2, and L3 cache utilization. Runtimes reach a minimum of 37 percent of the AoS time for block sizes of 64 and 32 elements. At 16 elements/block, runtimes begin to increase. For all block sizes, the arrays fit within the L3 cache; the much smaller, single-element arrays of AoS almost fit within the L2 data cache.
Figure 5 shows output from the roofline analysis performed by Intel Advisor 2017 with results from the key cases manually superposed onto a single plot for comparison. Note the log scale on both axes. The roofline analysis with the graphical tools as implemented in Intel Advisor provides a convenient mechanism to screen the performance of key loops across large codes (note the hamburger menu in upper-right corner). Pointer movement over the symbols creates pop-ups with loop location in source code, GFLOPS, and floating-point operations per byte of L1 cache. Clicking on the symbol brings up source code for the loop annotated with additional performance information (e.g., if peel and remainder loops were executed for the loop).
During the inner loop of the Ke update at lines 33 to 41, the best performance of 8.08 GFLOPS for block size (span) = 64 resides effectively on the roofline for the L2 cache but well below the peak rate for Fused-Multiply-Add (FMA) vector instructions, which are the kernel for this loop. The inner loop of the DeBe computation at lines 11 to 27 gains a surprising 3.7x performance boost when the block size decreases from 1024→128 but there is no further gain for smaller blocks. The longest running loop at lines 33 to 41 gains a 2x performance increase for block size reductions 1024→64.
As expected from the timing results (Figure 4), GFLOPS and cache utilization for AoS shown by the blue filled and outlined squares are smaller than those for the blocked SoA. Replacement of code in the routine with the direct ek = ek + matmul(transpose(b),matmul(d,b)) increases runtime by 30 percent above the version in Figure 3A.
The best performance of 8.08 GFLOPS relative to the double-precision FMA peak of 51.4 GFLOPS shown on the roofline derives from the small number of floating point operations per memory access. This is a well-known characteristic of element-level computations inherent in the finite element method. Consider the key inner loop at lines 33 to 41, which has 12 double operations (6 + and 6 *) and 13 references to double-precision array terms. The compute intensity per byte is then, 12/(13*8) = 0.11, the same value reported in the roofline analysis as FLOP/L1-byte. This value remains unchanged with block size. For reference, the figure also indicates the best performing routine in MKL BLAS as displayed by the roofline analysis during equation solving in WARP3D executions (single thread).
A few final points from the benchmarks:
option –align array64byteneeded for best performance on processors with AXV-512 instructions. Thirty-two-byte alignment suffices for the AVX2 capable processor used here. Nevertheless, with the separate compilation of the driver.f and routines.f files representative of build processes, the compiler is unaware of the actual alignment of the formal array arguments to the subroutine bdbtSoA. These adjustable arrays in the Fortran context guarantee to the compiler that array elements are contiguous in memory, but not necessarily with a specific alignment. The annotated source listing shows unaligned access to all arrays as expected.
Experiments with source code directives of the type !DIR$ ASSUME_ALIGNED b(1,1,1):64, d(1,1,1):64, ... or, separately, !DIR$ VECTOR ALIGNED before the two inner loops yield no measurable improvement in performance. The compiler listing indicates aligned access for all arrays.
The explanation may lie in the Intel Advisor messages that so-called peel and remainder loops, used to execute separately the initial loop iterations until access becomes aligned, are not executed here even without source code directives. At runtime, the alignment of arrays before entering the loop is easily determined to be 64 bytes in all cases here from the –align array64byte option in compiling driver.f.
Unit stride. A simple experiment demonstrates the critical significance of stride-1 array references on performance (at least on this processor). A switch in design/use of array ek_symm(span,utsz) to ek_symm(utsz,span) with ek_symm(i,j) changed to ek_symm(j,i) at line 34 reduces performance from 8.08 GFLOPS to 2.39 GFLOPS.
Vectorization offers potential speedups in codes with significant array-based computations—speedups that amplify the improved performance obtained through higher-level, parallel computations using threads and distributed execution on clusters. Key features for vectorization include tunable array sizes to reflect various processor cache and instruction capabilities and stride-1 accesses within inner loops.
The importance of vectorization to increase performance will continue to grow as hardware designers extend the number of vector registers to eight doubles (and hopefully more) on emerging processors to overcome plateauing clock rates and thread scalability.
Intel Advisor, especially the new roofline analysis capability, provides relevant performance information and recommendations implemented in a convenient GUI at a level suitable for code developers with differing levels of expertise.
The author gratefully acknowledges the many, key contributions of his former graduate students, postdocs, and other collaborators worldwide toward making WARP3D a useful code for researchers. See warp3d.net for a complete list and sponsors. These latest efforts toward improving code performance are in collaboration with Drs. Kevin Manalo and Karen Tomko, supported by the Intel® Parallel Computing Center at the Ohio Supercomputer Center.
Robert H. Dodds Jr., PhD, NAE, is the M.T. Geoffrey Chair (Emeritus) of Civil Engineering at the University of Illinois at Urbana–Champaign. He currently is a research professor in the Department of Civil Engineering, University of Tennessee, Knoxville.