This article recently appeared in Issue 31 of The Parallel Universe magazine.

Automated driving workloads include several matrix operations at their core. Sensor fusion and
localization algorithms―such as different versions of the Kalman* filter―are critical components
in the automated driving software pipeline. The Intel® Math Kernel Library (Intel® MKL) is a
powerhouse of tuned subprograms for numerous math operations, including a fast DGEMM. The
automated driving developer community typically uses Eigen^{*,1}, a C++ math library, for matrix
operations. In addition to Intel MKL, LIBXSMM^{*2,} ^{3}, a highly-tuned library for high-performance
matrix-matrix multiplications, shows potential to speed up matrix operations. In this article, we
investigate and improve the performance of native Eigen on matrix multiplication benchmarks and
the extended Kalman filter (EKF)^{5} by using Intel MKL and LIBXSMM with GNU* and Intel® compilers
on the Intel® Xeon® processor.^{4}, ^{5}

The automated driving pipeline includes a series of computational blocks, starting with perception, which acquires information on the driving environment from sensors such as cameras, RADARs and LIDARs, sensor fusion and localization, path planning, and finally actuation of vehicle controls such as steering angle and throttle. Performance optimizations across the entire software pipeline are crucial for meeting strict end-toend latency requirements. Each component of the pipeline is typically assigned a tight latency budget that needs to be met almost 100 percent of the time. In this study, we focus on speeding up the EKF, an important component of sensor fusion and localization.

The EKF is a simple―yet extremely powerful―algorithm that makes predictions about the state of the vehicle (e.g., Cartesian position coordinates, velocities, yaw angle). The EKF has two consecutive steps over several iterations:

- The prediction step estimates values of current variables and their uncertainties based on motion models, including changes in values over time.
- The update step occurs when the next set of measurements is received from the sensors. This phase
updates the predicted estimates based on one important factor—the weighted average of the predicted
estimate and the estimate from the current measurement. Higher weights imply lower uncertainty
^{6}.

In particular, this algorithm predicts the position of the vehicle (px,py) and its velocity (vx,vy) from
noisy LIDAR and RADAR sensor measurements. The coupled estimate of the vehicle's position from
fusing both RADAR and LIDAR has higher accuracy than using noisy LIDAR and RADAR by themselves.
LIDAR measurements that localize an object are defined in Cartesian coordinate form—(px,py). RADAR
measurements are typically in polar coordinate form and can be converted to Cartesian coordinates, forming
measurements that are at a lower resolution than those from LIDAR^{6}.

**Table 1** shows the vectors and matrices the EKF uses to represent different states and estimates^{4,} ^{5}

**Predict**

x’ = F *x + u Predicted state estimate

P’ = F * P *FT+ Q Predicted covariance estimate

**Measurement Update**

y = z – H’ * x Innovation or measurement residual

S = H * P’ * HT+ R Innovation (or residual) covariance

K = P’ * HT * S-1 Near-optimal Kalman gain

x = x’ + K * y Updated state estimate

P = (I – K * H) * P’ Updated covariance estimate

Intel MKL provides highly optimized, threaded, and vectorized math functions that maximize performance on
Intel® processor architectures. It is compatible across many different compilers, languages, operating systems,
linking, and threading models. Important for our purposes, it provides a highly-tuned DGEMM function for
matrix-matrix multiplication. To eliminate overhead from additional error checking for DGEMM on small
matrices, Intel MKL provides the –DMKL_DIRECT_CALL compiler flag to guarantee that the fastest code path
is used at runtime^{7}.

Eigen is an open-source, easy-to-use C++ library that provides operations ranging from matrix math to geometry algorithms. It enables vectorization across different levels of SSE and AVX. Eigen can take advantage of Intel MKL through the –DEIGEN_USE_MKL_ALL flag

LIBXSMM is an open-source, high-performance library tuned for fast matrix-matrix multiplication on very
small matrix sizes. LIBXSMM generates just-in-time (JIT) code for small matrix-matrix multiplication kernels
for various instruction sets including SSE, AVX, AVX2, and AVX512. LIBXSMM is best suited for matrices where
(M*N*K)^{1/3} is less than 80. LIBXSMM provides high performance through its modular design―specifically,
a separate frontend (high-level language and routine selection) and backend for xGEMM code generation2.
LIBXSMM provides a simple interface to call S/DGEMM to integrate into an application with very little effort.

**Figures 1, 2,** and **3** show three modes in which LIBXSMM can be used for matrix multiplications

During installation, LIBXSMM can be built explicitly for:

- Particular M, N, and K values
- Leading dimension values that differ from M, N, and K values
- Specific values of α and β

void libxsmm smm(int m, int n, int k, const float* a, const float* b, float* c);

void libxsmm dmm(int m, int n, int k, const double* a, const double* b, double* c);

**1. Automatically dispatched matrix multiplication API in LIBXSMM**

void libxsmm simm(int m, int n, int k, const float* a, const float* b, float* c);

void libxsmm dimm(int m, int n, int k, const double* a, const double* b, double* c);

**2. Non-dispatched matrix multiplication API in LIBXSMM**

void libxsmm sblasmm(int m, int n, int k, const float* a, const float* b, float* c);

void libxsmm dblasmm(int m, int n, int k, const double* a, const double* b, double* c);

**3. LIBXSMM API for matrix multiplication using BLAS ^{8}**

In its original form, Eigen does not use Intel MKL for small matrix multiplication (specifically, when M+N+K is less than 20). To allow Eigen to call the DGEMM function in Intel MKL, we modify the Eigen source code to eliminate the M+N+K<20 heuristic and permit calls to Intel MKL DGEMM for all matrix sizes.

To enable LIBXSMM in Eigen, we replace Eigen’s native matrix-matrix multiplication implementation with a call to libxsmm_dgemm.

**Experiment Setup**

We examine the performance of two workloads that use Eigen:

**A simple DGEMM benchmark**that implements DGEMM on a set of square, double-precision matrices**An implementation of EKF**that works on synthetically generated RADAR and LIDAR data

We use native Eigen, Eigen with Intel MKL, and Eigen with LIBXSMM in these experiments. All benchmarks are executed in serial.

**Table 2** details our library and compiler versions and hardware specifications

In this DGEMM benchmark, our figure of merit is the improvement in performance (gigaflops/second) over
native Eigen with g++. With the exception of matrix sizes 2 and 4, both Eigen with Intel MKL and Eigen with
LIBXSMM provide a speedup over native Eigen across all classes of matrices. It is interesting to note that native
Eigen has the lowest performance, regardless of whether it is compiled with GNU or Intel® compilers (**Figures 4**
and **5**). In terms of performance improvement, the overall trend is that:

**Eigen+LIBXSMM**produces the highest performance across all matrices (excluding matrix sizes 2 and 4).**Eigen+LIBXSMM with g++**produces the highest speedup for matrices of sizes less than size 13.**Eigen+LIBXSMM with ICPC**produces the highest speedup across all g++ and ICPC variants for matrix sizes greater than 13.

We evaluate EKF by using native Eigen, Eigen with Intel MKL, and Eigen with LIBXSMM. From our earlier
DGEMM benchmarking, we see that g++ provides higher performance for matrix sizes less than 13. Since EKF
works on smaller matrices, we evaluate speedup in EKF using g++. Our baseline for evaluating speedup is EKF
that uses native Eigen. Our figure of merit is the median time to predict and update each sensor measurement
(a total of 10,000 sensor measurements were processed). As shown in **Figure 6**, incorporating Intel MKL or
LIBXSMM can produce a speedup of approximately 1.2X in EKF.

In this article, we concentrated on speeding up the performance of EKF, a common automated driving workload used for sensor fusion and localization. We investigate this performance improvement on the Intel Xeon processor in two ways:

**Speeding up matrix-matrix multiplication kernel in native Eigen**from using Intel MKL and LIBXSMM**Improving performance**of the EKF workload

We show a maximum speedup of 3.1X over native Eigen from using Eigen+LIBXSMM with the Intel C++ compiler. We improved EKF performance by using Intel MKL and LIBXSMM to produce a speedup of 1.2X

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

For more information regarding performance and optimization choices in Intel® Software Development Products, see our Optimization Notice.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.