Intel Software

This article recently appeared in Issue 31 of The Parallel Universe magazine.

Good design decisions are based on good data:

  • What loops should be threaded and vectorized first?
  • Is the performance gain worth the effort?
  • Will the threading performance scale with higher core counts?
  • Does this loop have a dependency that prevents vectorization?
  • What are the trip counts and memory access patterns?
  • Have you vectorized efficiently with the latest Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions? Or are you using older SIMD instructions?

Intel® Advisor is a dynamic analysis tool that’s part of Intel® Parallel Studio XE, Intel’s comprehensive tool suite for building and modernizing code. Intel Advisor answers these questions―and many more. You can collect insightful program metrics on the vectorization and memory profile of your application. And, besides providing tailored reports using the GUI and command line, Intel Advisor now gives you the added flexibility to mine a collected database and create powerful new reports using Python*.

When you run Intel Advisor, it stores all the data it collects in a proprietary database that you can now access using a Python API. This provides a flexible way to generate customized reports on program metrics. This article will describe how to use this new functionality.

Getting Started

To get started, you need to setup the Intel Advisor environment. (For this article, we ran all the scripts on Linux*, but the Intel Advisor Python API also supports Windows*.)

Share this article:

    @IntelDevTools

source advixe-vars.sh

Next, to set up the Intel Advisor data, you need to run some collections. Some of the program metrics require additional analysis such as tripcounts, memory access patterns, and dependencies.

advixe-cl --collect survey --project-dir ./your_project -- <your-executable-with-parameters> 

advixe-cl --collect tripcounts -flops-and-masks -callstack-flops --project-dir
./your_project -- <your-executable-with-parameters>

To run a map or dependencies collection, you need to specify the loops that you want to analyze. You can find this information using the Intel Advisor GUI or by doing a command-line report

advixe-cl --collect map –mark-up-list=1,2,3,4 --project-dir ./your_project --<your-executable-with-parameters> 

advixe-cl --collect dependencies –mark-up-list=1,2,3,4 --project-dir ./your_project --<your-executable-with-parameters>

Finally, you will need to copy the Intel Advisor reference examples to a test area.

cp –r /opt/intel/advisor_2018/pythonapi/examples .

Note that all the scripts we ran for this article use the Python that currently ships with Intel Advisor on Linux. The standard distributions of Python should also work just as well

Using the Intel Advisor Python API

The reference examples we’ve provided are just small set of the reporting that’s possible using this flexible way to access your program data. You could use the columns.py example to get a list of available data fields. For example, you could see the metrics in Table 1 after running a basic survey collection.



Table 1. Sample survey metrics



Intel Advisor Python API in Action

Let’s walk through a simple example that shows how to collect some powerful metrics using the Intel Advisor Python API. The first step is to import the Intel Advisor library package.

import advisor

You then need to open the Intel Advisor project that contains the result you’ve collected.

project = advisor.open_project(sys.argv[1])

You also have the option of creating a project and running collections. (In the example below, we’re just doing an open_project.) In this example, we access data from the memory access pattern (MAP) collection. We do this using the following line of code:

data = project.load(advisor.MAP)

Once we’ve loaded this data, we can loop through the table and gather cache utilization statistics. We then print out the data we’ve collected:

import sys
try:
# First import the Advisor library
	import advisor
except ImportError:
	sys.exit(1)
# Open your Advisor project
	project = advisor.open_project(sys.argv[1])
# Load the Memory Access Pattern(MAP) data
	data = project.load(advisor.MAP)
#Loop through the MAP data and gather information about cache utilization
for site in data.map:
	site_id = site['site_id']
	cachesim = data.get_cachesim_info(site_id)
	print(indent * 2 + 'Average utilization'.ljust(width) + ' =
{:.2f}%'.format(cachesim.utilization))

Intel Advisor Python API Advanced Topics

The examples provided as part of the Intel Advisor Python API give you a blueprint for writing your own scripts. Table 2 shows some of these advanced capabilities.



Table 2. Intel Advisor Python API advanced capabilities



Here are some highlights of our various examples. We are constantly adding to the list of examples.

Generate a combined report showing all data collected:

project = advisor.create_project(project_dir)

Generate an html report:

advixe-python to_html.py ./your_project

You can generate a roofline HTML chart (Figure 1) with this code:

python roofline.py ./your_project

You must run the roofline.py script with an external Python command and not advixe-python. It currently only runs on Linux. It also requires the additional libraries numpy, pandas, and matplotlib to be installed. Use this code to generate cache simulation statistics:

advixe-python cache.py ./your_project



Figure 1 - Roofline HTML chart



You can see the results we obtained from the cache model in Table 3.



Table 3. Cache model results



Case Study: Vectorization Comparison

In this case study, we create a Python script that can compare the vectorization of a given loop when compiled with different compiler options.

Step 1: Compile Code with Different Optimization Flags

First, compile the app with different options. In this example, we use the Intel® C++ Compiler (but Intel Advisor works at the binary level, so any compiler should work). In the first case, we are compiling without optimization using the compiler option -O0. The second case uses full optimization -O3.

icc loops1.cpp -O0 -g -debug inline-debug-info -qopt-report=5 -ipo- -o loops1-no-opt
icc loops1.cpp -O3 -g -debug inline-debug-info -qopt-report=5 -ipo- -o loops1


Step 2: The Python Code

The script is very simple. First, get some arguments from the command-line. If they are being passed an Intel Advisor project, then use the data contained in the project. Otherwise, do an Intel Advisor survey run. Once the survey runs complete, decode the assembly for the loops and print the instructions of the two loops side-by-side. The main function in our Python code is named get_formatted_asm. This function is able to access the Intel Advisor database and decode the assembly for our loops. It can also check whether the assembly code is using vector instructions, as well as how fast the loop executed.

import sys
import itertools
import advisor
# second form allows collecting data before analysis # first form just analyses already collected data
if len(sys.argv) < 3 or len(sys.argv) > 6: print('''
Usage: advixe-python {} path_to_project_dir loop1 [loop2]
Or: advixe-python {} path_to_project_dir loop1 [loop2] executable1 executable2
Where loop1 and loop2 are in the form source:line
'''.format(__file__, __file__)) sys.exit(1)
project_dir = sys.argv[1] project_dir1 = project_dir + ".1" project_dir2 = project_dir + ".2"
loop1 = sys.argv[2] # if we have an odd number of arguments (including script name) then loop2 is the same as loop1 loop2 = sys.argv[3] if len(sys.argv)%2 == 0 else loop1
binary1 = '' binary2 = ''
# in the second form two last args are executables to run if 4 < len(sys.argv) < 7: binary1 = sys.argv[-2] binary2 = sys.argv[-1]
# try open or create project, run collection if needed: # returns formatted asm listing with vectorized instructions marked with "VEC " for given loop def get_formatted_asm(project_dir, binary, loop): asm = []
try: project = advisor.open_project(project_dir) except: project = advisor.create_project(project_dir)
if binary: project.collect(advisor.SURVEY, binary)
data = project.load(advisor.SURVEY) for entry in data.bottomup: if loop in entry['function_call_sites_and_loops']: asm += ["{:54.54} ".format(entry['function_call_sites_and_loops']), "{:54.54} ".format("Self time: " + entry['self_time']), " "*54] for instruction in entry.assembly: isVectorized = "VEC" if "VECTORIZED" in instruction['instruction_type'] else "" asm.append("{:4.4}{:50.50} ".format(isVectorized, instruction['asm'])) asm.append("") return asm
asm1 = get_formatted_asm(project_dir1, binary1, loop1) asm2 = get_formatted_asm(project_dir2, binary2, loop2)
# print alongside asm listings for comparison for (a1,a2) in itertools.izip_longest(asm1, asm2, fillvalue = ' '*40): print("{}{}".format(a1,a2))


Step 3: Run the Python Script

            
advixe-python compare_asm.py /home/work/projects/loops-compare loops1.cpp:34 
/home/work/tests/loops/loops1-no-opt /home/work/tests/loops/loops1
[loop in main at loops1.cpp:34]                       
Self time: 45.5463      

 Block 1                                                                        
 movl  -0xbc(%rbp), %eax
 movsxd %eax, %rax
 imul $0x8, %rax, %rax
 addq  -0x88(%rbp), %rax
 movl  -0xac(%rbp), %edx
 imull  -0xac(%rbp), %edx
 mov $0x1, %ecx
 addl  -0xac(%rbp), %ecx
 movq  %rax, -0x28(%rbp)
 mov %edx, %eax
 cdq
 idiv %ecx
 cvtsi2sd %eax, %xmm0
 movl  -0xc0(%rbp), %eax
 cvtsi2sd %eax, %xmm1
 movsdq  0x555(%rip), %xmm2
 divsd %xmm2, %xmm1
 addsd %xmm1, %xmm0
 movq  -0x28(%rbp), %rax
 movsdq  (%rax), %xmm1
 subsd %xmm0, %xmm1
 movl  -0xbc(%rbp), %eax
 movsxd %eax, %rax
 imul $0x8, %rax, %rax
 addq  -0x88(%rbp), %rax
 movsdq  %xmm1, (%rax)
 mov $0x1, %eax
 addl  -0xac(%rbp), %eax
 movl  %eax, -0xac(%rbp)
 movl  -0xac(%rbp), %eax
 cmp $0x64, %eax
 jl 0x401020 <Block 1>
[loop in main at loops1.cpp:34]    
Self time: 4.62404 

  	Block 1                                  
  VEC movdqa %xmm11, %xmm2                     
  VEC movdqa %xmm11, %xmm0                     
  VEC psrlq $0x20, %xmm2                       
  VEC movdqa %xmm10, %xmm1                     
  VEC pmuludq %xmm2, %xmm2                     
  VEC pmuludq %xmm11, %xmm0                    
  VEC psllq $0x20, %xmm2                       
  VEC pand %xmm13, %xmm0                       
  VEC por %xmm2, %xmm0                         
  callq  0x401770 <__svml_idiv4>          
  	Block 2                                  
  VEC cvtdq2pd %xmm0, %xmm2                    
  VEC punpckhqdq %xmm0, %xmm0                  
  	add $0x4, %r15b                          
  VEC cvtdq2pd %xmm0, %xmm3                    
  VEC addpd %xmm14, %xmm2                      
  VEC addpd %xmm14, %xmm3                      
  VEC subpd %xmm2, %xmm12                      
  VEC subpd %xmm3, %xmm8                       
  VEC paddd %xmm15, %xmm11                     
  VEC paddd %xmm15, %xmm10                     
  	cmp $0x64, %r15b                         
  	jb 0x401397 <Block 1> 
    

Step 4: Recompile with AVX2 Vectorization

Now let’s try a further optimization. Since our processor supports the AVX2 instruction set, we are going to tell the compiler to generate AVX2. (You should note that this generally not what the compiler with generate by default.)

icc loops1.cpp -O3 -xCORE-AVX2 -g -debug inline-debug-info -qopt-report=5 -ipo- -o loops1-avx2

Step 5: Rerun the Comparison

            
advixe-python compare_asm.py /home/work/projects/loops-compare-opt loops1.cpp:34
/home/work/tests/loops/loops1 /home/work/tests/loops/loops1-avx2
[loop in main at loops1.cpp:34] 
Self time: 4.81401

 Block 1
VEC movdqa %xmm11, %xmm2 
VEC movdqa %xmm11, %xmm0 
VEC psrlq $0x20, %xmm2 
VEC movdqa %xmm10, %xmm1 
VEC pmuludq %xmm2, %xmm2 
VEC pmuludq %xmm11, %xmm0 
VEC psllq $0x20, %xmm2 
VEC pand %xmm13, %xmm0 
VEC por %xmm2, %xmm0 
 callq 0x401770 <__svml_idiv4> 
 Block 2 
VEC cvtdq2pd %xmm0, %xmm2
VEC punpckhqdq %xmm0, %xmm0 
 add $0x4, %r15b 
VEC cvtdq2pd %xmm0, %xmm3 
VEC addpd %xmm14, %xmm2 
VEC addpd %xmm14, %xmm3
VEC subpd %xmm2, %xmm12
VEC subpd %xmm3, %xmm8
VEC paddd %xmm15, %xmm11
VEC paddd %xmm15, %xmm10
 cmp $0x64, %r15b
 jb 0x401397 <Block 1> 
[loop in main at loops1.cpp:34]
Self time: 1.97998

	Block 1 
VEC vpmulld %ymm15, %ymm15, %ymm0
  VEC vmovdqa %ymm14, %ymm1
  callq 0x4018f0 <__svml_idiv8>
	Block 2
	add $0x8, %r15b
VEC vextracti128 $0x1, %ymm0, %xmm2
	VEC vcvtdq2pd %xmm0, %ymm3
VEC vpaddd %ymm13, %ymm15, %ymm15
	VEC vcvtdq2pd %xmm2, %ymm5
VEC vpaddd %ymm13, %ymm14, %ymm14
VEC vaddpd %ymm3, %ymm10, %ymm4
VEC vaddpd %ymm5, %ymm10, %ymm6
VEC vsubpd %ymm4, %ymm8, %ymm8
VEC vsubpd %ymm6, %ymm9, %ymm9
	cmp $0x60, %r15b
	jb 0x4015a9 <Block 1> 

You can see that the assembly code now uses YMM registers instead of XMM, doubling the vector length and giving a 2X speedup.

Results

The gains we made by optimizing and by using the latest vectorization instruction set were significant:

  • No optimization of -O0:45.148 seconds
  • Optimizing -O3:4.403 seconds
  • Optimizing and AVX2 –O3 –AVX2:2.056 seconds


Maximizing System Performance

On modern processors, it’s crucial to both vectorize and thread software to realize the full performance potential of the processor. The new Intel Advisor Python API in Intel Parallel Studio XE provides a powerful way to generate program statistics and reports that can help you get the most performance out of your system. The examples we outlined in this article illustrate the power of this new interface. Based on your specific needs, you can tailor and extend these examples. Intel is actively gathering feedback on the Intel Advisor Python API. If you've tried it and found it useful, or would like to provide feedback, send email to mvector_advisor@intel.com




 

 

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

For more information regarding performance and optimization choices in Intel® Software Development Products, see our Optimization Notice.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.