This article recently appeared in Issue 31 of The Parallel Universe magazine.
Automated driving workloads include several matrix operations at their core. Sensor fusion and localization algorithms―such as different versions of the Kalman* filter―are critical components in the automated driving software pipeline. The Intel® Math Kernel Library (Intel® MKL) is a powerhouse of tuned subprograms for numerous math operations, including a fast DGEMM. The automated driving developer community typically uses Eigen*,1, a C++ math library, for matrix operations. In addition to Intel MKL, LIBXSMM*2, 3, a highly-tuned library for high-performance matrix-matrix multiplications, shows potential to speed up matrix operations. In this article, we investigate and improve the performance of native Eigen on matrix multiplication benchmarks and the extended Kalman filter (EKF)5 by using Intel MKL and LIBXSMM with GNU* and Intel® compilers on the Intel® Xeon® processor.4, 5
The automated driving pipeline includes a series of computational blocks, starting with perception, which acquires information on the driving environment from sensors such as cameras, RADARs and LIDARs, sensor fusion and localization, path planning, and finally actuation of vehicle controls such as steering angle and throttle. Performance optimizations across the entire software pipeline are crucial for meeting strict end-toend latency requirements. Each component of the pipeline is typically assigned a tight latency budget that needs to be met almost 100 percent of the time. In this study, we focus on speeding up the EKF, an important component of sensor fusion and localization.
The EKF is a simple―yet extremely powerful―algorithm that makes predictions about the state of the vehicle (e.g., Cartesian position coordinates, velocities, yaw angle). The EKF has two consecutive steps over several iterations:
In particular, this algorithm predicts the position of the vehicle (px,py) and its velocity (vx,vy) from noisy LIDAR and RADAR sensor measurements. The coupled estimate of the vehicle's position from fusing both RADAR and LIDAR has higher accuracy than using noisy LIDAR and RADAR by themselves. LIDAR measurements that localize an object are defined in Cartesian coordinate form—(px,py). RADAR measurements are typically in polar coordinate form and can be converted to Cartesian coordinates, forming measurements that are at a lower resolution than those from LIDAR6.
x’ = F *x + u Predicted state estimate
P’ = F * P *FT+ Q Predicted covariance estimate
y = z – H’ * x Innovation or measurement residual
S = H * P’ * HT+ R Innovation (or residual) covariance
K = P’ * HT * S-1 Near-optimal Kalman gain
x = x’ + K * y Updated state estimate
P = (I – K * H) * P’ Updated covariance estimate
Intel MKL provides highly optimized, threaded, and vectorized math functions that maximize performance on Intel® processor architectures. It is compatible across many different compilers, languages, operating systems, linking, and threading models. Important for our purposes, it provides a highly-tuned DGEMM function for matrix-matrix multiplication. To eliminate overhead from additional error checking for DGEMM on small matrices, Intel MKL provides the –DMKL_DIRECT_CALL compiler flag to guarantee that the fastest code path is used at runtime7.
Eigen is an open-source, easy-to-use C++ library that provides operations ranging from matrix math to geometry algorithms. It enables vectorization across different levels of SSE and AVX. Eigen can take advantage of Intel MKL through the –DEIGEN_USE_MKL_ALL flag
LIBXSMM is an open-source, high-performance library tuned for fast matrix-matrix multiplication on very small matrix sizes. LIBXSMM generates just-in-time (JIT) code for small matrix-matrix multiplication kernels for various instruction sets including SSE, AVX, AVX2, and AVX512. LIBXSMM is best suited for matrices where (M*N*K)1/3 is less than 80. LIBXSMM provides high performance through its modular design―specifically, a separate frontend (high-level language and routine selection) and backend for xGEMM code generation2. LIBXSMM provides a simple interface to call S/DGEMM to integrate into an application with very little effort.
Figures 1, 2, and 3 show three modes in which LIBXSMM can be used for matrix multiplications
During installation, LIBXSMM can be built explicitly for:
void libxsmm smm(int m, int n, int k, const float* a, const float* b, float* c);
void libxsmm dmm(int m, int n, int k, const double* a, const double* b, double* c);
1. Automatically dispatched matrix multiplication API in LIBXSMM
void libxsmm simm(int m, int n, int k, const float* a, const float* b, float* c);
void libxsmm dimm(int m, int n, int k, const double* a, const double* b, double* c);
2. Non-dispatched matrix multiplication API in LIBXSMM
void libxsmm sblasmm(int m, int n, int k, const float* a, const float* b, float* c);
void libxsmm dblasmm(int m, int n, int k, const double* a, const double* b, double* c);
3. LIBXSMM API for matrix multiplication using BLAS8
In its original form, Eigen does not use Intel MKL for small matrix multiplication (specifically, when M+N+K is less than 20). To allow Eigen to call the DGEMM function in Intel MKL, we modify the Eigen source code to eliminate the M+N+K<20 heuristic and permit calls to Intel MKL DGEMM for all matrix sizes.
To enable LIBXSMM in Eigen, we replace Eigen’s native matrix-matrix multiplication implementation with a call to libxsmm_dgemm.
We examine the performance of two workloads that use Eigen:
We use native Eigen, Eigen with Intel MKL, and Eigen with LIBXSMM in these experiments. All benchmarks are executed in serial.
Table 2 details our library and compiler versions and hardware specifications
In this DGEMM benchmark, our figure of merit is the improvement in performance (gigaflops/second) over native Eigen with g++. With the exception of matrix sizes 2 and 4, both Eigen with Intel MKL and Eigen with LIBXSMM provide a speedup over native Eigen across all classes of matrices. It is interesting to note that native Eigen has the lowest performance, regardless of whether it is compiled with GNU or Intel® compilers (Figures 4 and 5). In terms of performance improvement, the overall trend is that:
We evaluate EKF by using native Eigen, Eigen with Intel MKL, and Eigen with LIBXSMM. From our earlier DGEMM benchmarking, we see that g++ provides higher performance for matrix sizes less than 13. Since EKF works on smaller matrices, we evaluate speedup in EKF using g++. Our baseline for evaluating speedup is EKF that uses native Eigen. Our figure of merit is the median time to predict and update each sensor measurement (a total of 10,000 sensor measurements were processed). As shown in Figure 6, incorporating Intel MKL or LIBXSMM can produce a speedup of approximately 1.2X in EKF.
In this article, we concentrated on speeding up the performance of EKF, a common automated driving workload used for sensor fusion and localization. We investigate this performance improvement on the Intel Xeon processor in two ways:
We show a maximum speedup of 3.1X over native Eigen from using Eigen+LIBXSMM with the Intel C++ compiler. We improved EKF performance by using Intel MKL and LIBXSMM to produce a speedup of 1.2X
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
For more information regarding performance and optimization choices in Intel® Software Development Products, see our Optimization Notice.
Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
*Other names and brands may be claimed as the property of others.