The Histogram Analysis Tool - Mix
Running profiling tool like Intel® VTune™ or Linux Perf will profile Intel® SDE and not the emulated application. Therefore, Intel® SDE provides a very useful instruction mix histograms for profiling the application. This analysis tool generates a few types of analysis information which is written into the output file in machine and human readable format. The file is divided to sections each contains specific data.
The output file includes header, list of loaded images, data per thread and a global summary. Each per thread section has the top-basic-block section, the instructions’ histogram, function table, and histogram per function. The histogram can show the instructions by opcode (XED iclass, the default) or by the instruction form (XED iform). Additionally instruction groups are also shown in the histogram. The groups are marked with a star (‘*’) and include instructions category, instructions ISA-set, instruction length and more. Information about the various groups can be found below. Optionally, the mix analysis tool can collect dynamic control flow information. When specified, this data will be emitted in per thread sections in their own sub-sections.
Intel® SDE activate the mix analysis tool when -mix or -omix knobs are specified in the command line. There are many options to control the collected data and the dump to the output file.
Mix Knobs
- -d
Only collect dynamic profile [default 1]
- -demangle
Control for symbol demangling [default 1]
- -dynamic_stats_per_block
Print dynamic stats per block [default 0]
- -dynamic_stats_per_loop
Print dynamic stats per loop [default 0]
- -function_call_counts
Collect number of times each function is called [default 1]
- -global_functions
Print global functions report [default 1]
- -global_hot_blocks
Print global hot blocks [default 1]
- -hottest_threads_order
thread stats prints are ordered by instruction count [default 0]
- -iform
Compute ISA histogram per XED IFORM [default 0]
- -line_info
Add line info to the top hot blocks [default 1]
- -map_all_blocks
Map all the blocks instead of only top blocks [default 0]
- -mapaddr
Emit mappings: Address -> Source file and Line [default 0]
- -mapaddr_top_blocks
Emit mappings for top blocks: Address -> Source file and Line [default 0]
- -mix
Compute mix histogram analysis tool [default 0]
- -mix_concat_bbls
Concatenate consecutive blocks statistics [default 1]
- -mix_count_rep_iterations
Count each iteration of repeat string instructions, as a separate instruction [default 1]
- -mix_filter_no_shared_libs
Do not instrument shared libraries [default ]
- -mix_loops
Supply loops statistics [default 0]
- -mix_loops_threads
Supply loops statistics per thread [default 1]
- -mix_max_cumulative
Specify maximum cumulative, stops printing functions when reached max cumulative [default 97]
- -mix_disable_per_function_stats
Omit the per-function histograms, reduces the output file size [default 0]
- -mix_disable_per_thread_stats
Omit per-thread stats, reduces output file size [default 0]
- -mix_opt_report
Add optimization report messages to mix file [default 0]
- -mix_top_functions
Specify maximum number of top function to be printed (0 unlimited) [default 10]
- -mix_top_loops
Specify maximum number of top loops for which statistics are printed, sorted by iteration count [default 10]
- -mix_vpconflict
Add VCONFLICT stats to mix file [default 0]
- -no_shared_libs
Do not instrument shared libraries [default 0]
- -omix
Specify profile output file name [default sde-mix-out.txt]
- -s
Terminate after collection of the static profile for the main image [default 0]
- -top_blocks
specify maximum number of top blocks for which instruction counts are printed [default 20]
Mix Output File format
The Mix output file includes several sections. The first section is the header and it include multiple rows that start with # sign.
# Mix output version 10
# Intel(R) SDE version: 9.15.0 external
# Starting tid 0, OS-TID 15891
# FINI: end of program
This section has Intel® SDE version number, the kit type and the list of threads.
Next section is the images section which is a list of all the images and their addresses in memory. This section has tags that mark the start and the end of this section.
# EMIT_IMAGE_ADDRESSES
#
# IMAGE NAME LOW ADDRESS HIGH ADDRESS
#
loops 000000400000 00000040a147
/lib64/ld-linux-x86-64.so.2 2aaaaaaab000 2aaaaaacbfeb
[vdso] 2aaaaaacf000 2aaaaaacfdba
/lib64/libm.so.6 2aaaf1c5c000 2aaaf1f58147
/lib64/libgcc_s.so.1 2aaaf1fdc000 2aaaf21f344f
/lib64/libc.so.6 2aaaf21f5000 2aaaf25999df
/lib64/libdl.so.2 2aaaf25a2000 2aaaf27a50ef
# END_IMAGE_ADDRESSES
The next sections are per thread data and include the top blocks stats, the dynamic stats histogram, the per function stats, the per function histogram.
# ==============================================
# STATS FOR TID 0 OS-TID 12005 EMIT# 1
# ==============================================
# EMIT_TOP_BLOCK_STATS FOR TID 0 OS-TID 12005 EMIT # 1 EVENT=ICOUNT
BLOCK: 0 PC: 0000000000401288 ICOUNT: 3500000 EXECUTIONS: 250000 #BYTES: 50 %: 21.7
cumltv%: 21.7 FN: main IMG: loops OFFSET: 1288 Source: loops.c 39,38
XDIS 0000000000401288: BASE 89C7 mov edi, eax
XDIS 000000000040128a: BASE B8D34D6210 mov eax, 0x10624dd3
XDIS 000000000040128f: BASE 89F9 mov ecx, edi
XDIS 0000000000401291: BASE F7EF imul edi
XDIS 0000000000401293: BASE C1F91F sar ecx, 0x1f
XDIS 0000000000401296: BASE 41FFC7 inc r15d
XDIS 0000000000401299: BASE C1FA06 sar edx, 0x6
XDIS 000000000040129c: BASE 2BD1 sub edx, ecx
XDIS 000000000040129e: BASE 69F218FCFFFF imul esi, edx, 0xfffffc18
XDIS 00000000004012a4: BASE 03FE add edi, esi
XDIS 00000000004012a6: BASE 4189BD00C94000 mov dword ptr [r13+0x40c900], edi
XDIS 00000000004012ad: BASE 4983C504 add r13, 0x4
XDIS 00000000004012b1: BASE 4181FFF4010000 cmp r15d, 0x1f4
XDIS 00000000004012b8: BASE 72C9 jb 0x401283
...
# END_TOP_BLOCK_STATS
# EMIT_DYNAMIC_STATS FOR TID 0 OS-TID 12005 EMIT #1
#
# $dynamic-counts
#
# TID 0
# opcode count
#
*stack-read 1012276
*stack-write 762328
*iprel-read 1251592
*iprel-write 500617
*mem-read-1 10875
*mem-read-2 820
*mem-read-4 2006936
*mem-read-8 2024321
*mem-read-16 62675
...
SUB 759384
SYSCALL 52
TEST 763928
XGETBV 3
XOR 502198
*total 16097397
# END_DYNAMIC_STATS
# FUNCTION TOTALS FOR TID 0 OS-TID 12005
#rank total-icount % cumulative% #times-called address function-name image-name
0: 6491936 40.329 40.329 250000 2aaaf222d480 random_r IMG: /lib64/libc.so.6
1: 4250000 26.402 66.731 250000 2aaaf222d2f0 random IMG: /lib64/libc.so.6
2: 3754701 23.325 90.056 1 401230 main IMG: loops
3: 1000000 6.212 96.268 250000 2aaaf222d760 rand IMG: /lib64/libc.so.6
4: 250002 1.553 97.821 250000 401130 rand IMG: loops
5: 187533 1.165 98.986 1 4019d0 __intel_ssse3_rep_memcpy IMG: loops
6: 36379 0.226 99.212 122 2aaaaaab3f10 do_lookup_x IMG: /lib64/ld-linux-x86-64.so.2
7: 29750 0.185 99.397 6 2aaaaaab6390 _dl_relocate_object IMG: /lib64/ld-linux-x86-64.so.2
8: 24866 0.154 99.551 122 2aaaaaab4b70 _dl_lookup_symbol_x IMG: /lib64/ld-linux-x86-64.so.2
9: 18481 0.115 99.666 2595 2aaaaaac3cb0 strcmp IMG: /lib64/ld-linux-x86-64.so.2
10: 5490 0.034 99.700 104 2aaaaaab3d70 check_match IMG: /lib64/ld-linux-x86-64.so.2
16097397 TOTAL
# END FUNCTION TOTALS
The end of the file include global stats which have global top blocks stats (accumulated over all the threads) Additional information can be attached to the mix output.
Mix Histogram Accounting
Intel® SDE mix tool count how many times each basic block is executed and at the end of the run generates the detailed report. The histogram has counts per instruction opcode (XED ICLASS) or per instruction opcode and its specific operands (XED IFORM). Together with the accounting of the instructions the mix tool groups instructions into various groups and emit their total count in a row marked with a star (‘*’).
Note
The same instruction might be counted multiple times in different groups.
Partial list of groups reported by the Mix tool
Group |
Description |
---|---|
*sse-scalar |
Scalar SSE instructions (the lowest element in xmm) e.g. ADDSD |
*avx-scalar |
Scalar AVX instructions e.g. VADDSD |
*scalar-simd |
Scalar instructions of any kind (some of all SSE, AVX and AVX-512 scalar instructions). |
*avx128 |
AVX instructions which are not scalar and not AVX-256. |
*avx256 |
AVX instructions with vector length 32 bytes. |
*avx512 |
AVX-512 instructions with vector length 64 bytes. |
*isa-ext-<E> |
Instruction with XED ISA extension E. |
*isa-set-<S> |
Instruction with XED ISA set S. |
*category-<C> |
Instruction with XED category C. |
*elements_fp_single_N |
Vector instruction (i.e. SSE, AVX or AVX-512) with a single precision operand and the operand length is N elements. |
*elements_fp_double_N |
Vector instruction with double precision operand and the operand length is N elements. |
*elements_fp_single_N_masked |
Same with masked vector instruction (AVX-512). |
*elements_fp_double_N_masked |
Same with masked vector instruction (AVX-512). |
The Intel® SDE kit has a text file called idata.txt that lists all the instructions supported by the kit. Each row in the file represents a single XED IFORM. This IFORM is the equivalent of an instruction and its operand. As a single instruction might have multiple flavors. (e.g. MOV has many flavors with GPRs, immediate, memory and more). This file is helpful in understanding which instructions belong to specific XED extension, XED category or XED ISA-set.
If you would like to know how many AVX instructions were executed then the best way is to use the *isa-ext-AVX or the *isa-set-AVX. These two should be the same.
Mix additional information
Intel® SDE provides additional analysis tools and add their output inside the mix output file.
Loop Analysis
Intel® SDE can use the dynamic control flow graph analysis tool to provide additional information about loops. To activate the loop analysis use the -mix_loops knob. The loop information can be emitted per thread or globally. The loop output is in its own section. The section is marked with tags and looks like:
# ==============================================
# LOOPS_STATS_FROM_DCFG_FOR_TID 0
# ==============================================
LOOP: 0 NUM BLOCKS: 4 HEAD BLOCK_ID: 28 ENTRIES: 1 EXECUTIONS: 500
EDGE FROM BLOCK_ID: 30 TO BLOCK_ID: 29 EXECUTIONS: 249500
EDGE FROM BLOCK_ID: 28 TO BLOCK_ID: 29 EXECUTIONS: 500
EDGE FROM BLOCK_ID: 30 TO BLOCK_ID: 31 EXECUTIONS: 500
EDGE FROM BLOCK_ID: 31 TO BLOCK_ID: 28 EXECUTIONS: 499
BLOCK_ID: 28 PC: 0040127d ICOUNT: 1000 EXECUTIONS: 500 #BYTES: 6 FN: main IMG: loops
XDIS 000000000040127d: BASE 4533FF xor r15d, r15d
XDIS 0000000000401280: BASE 4989DD mov r13, rbx
BLOCK_ID: 29 PC: 00401283 ICOUNT: 250000 EXECUTIONS: 250000 #BYTES: 5 FN: main IMG: loops
XDIS 0000000000401283: BASE E8A8FEFFFF call 0x401130
BLOCK_ID: 30 PC: 00401288 ICOUNT: 3500000 EXECUTIONS: 250000 #BYTES: 50 FN: main IMG: loops
XDIS 0000000000401288: BASE 89C7 mov edi, eax
XDIS 000000000040128a: BASE B8D34D6210 mov eax, 0x10624dd3
...
END_LOOP: 0
# ==============================================