The Histogram Analysis Tool - Mix

Running profiling tool like Intel® VTune™ or Linux Perf will profile Intel® SDE and not the emulated application. Therefore, Intel® SDE provides a very useful instruction mix histograms for profiling the application. This analysis tool generates a few types of analysis information which is written into the output file in machine and human readable format. The file is divided to sections each contains specific data.

The output file includes header, list of loaded images, data per thread and a global summary. Each per thread section has the top-basic-block section, the instructions’ histogram, function table, and histogram per function. The histogram can show the instructions by opcode (XED iclass, the default) or by the instruction form (XED iform). Additionally instruction groups are also shown in the histogram. The groups are marked with a star (‘*’) and include instructions category, instructions ISA-set, instruction length and more. Information about the various groups can be found below. Optionally, the mix analysis tool can collect dynamic control flow information. When specified, this data will be emitted in per thread sections in their own sub-sections.

Intel® SDE activate the mix analysis tool when -mix or -omix knobs are specified in the command line. There are many options to control the collected data and the dump to the output file.

Mix Knobs

-d

Only collect dynamic profile [default 1]

-demangle

Control for symbol demangling [default 1]

-dynamic_stats_per_block

Print dynamic stats per block [default 0]

-dynamic_stats_per_loop

Print dynamic stats per loop [default 0]

-function_call_counts

Collect number of times each function is called [default 1]

-global_functions

Print global functions report [default 1]

-global_hot_blocks

Print global hot blocks [default 1]

-hottest_threads_order

thread stats prints are ordered by instruction count [default 0]

-iform

Compute ISA histogram per XED IFORM [default 0]

-line_info

Add line info to the top hot blocks [default 1]

-map_all_blocks

Map all the blocks instead of only top blocks [default 0]

-mapaddr

Emit mappings: Address -> Source file and Line [default 0]

-mapaddr_top_blocks

Emit mappings for top blocks: Address -> Source file and Line [default 0]

-mix

Compute mix histogram analysis tool [default 0]

-mix_concat_bbls

Concatenate consecutive blocks statistics [default 1]

-mix_count_rep_iterations

Count each iteration of repeat string instructions, as a separate instruction [default 1]

-mix_filter_no_shared_libs

Do not instrument shared libraries [default ]

-mix_loops

Supply loops statistics [default 0]

-mix_loops_threads

Supply loops statistics per thread [default 1]

-mix_max_cumulative

Specify maximum cumulative, stops printing functions when reached max cumulative [default 97]

-mix_disable_per_function_stats

Omit the per-function histograms, reduces the output file size [default 0]

-mix_disable_per_thread_stats

Omit per-thread stats, reduces output file size [default 0]

-mix_opt_report

Add optimization report messages to mix file [default 0]

-mix_top_functions

Specify maximum number of top function to be printed (0 unlimited) [default 10]

-mix_top_loops

Specify maximum number of top loops for which statistics are printed, sorted by iteration count [default 10]

-mix_vpconflict

Add VCONFLICT stats to mix file [default 0]

-no_shared_libs

Do not instrument shared libraries [default 0]

-omix

Specify profile output file name [default sde-mix-out.txt]

-s

Terminate after collection of the static profile for the main image [default 0]

-top_blocks

specify maximum number of top blocks for which instruction counts are printed [default 20]

Mix Output File format

The Mix output file includes several sections. The first section is the header and it include multiple rows that start with # sign.

# Mix output version 10
# Intel(R) SDE version: 9.15.0 external
# Starting tid 0,  OS-TID 15891
# FINI: end of program

This section has Intel® SDE version number, the kit type and the list of threads.

Next section is the images section which is a list of all the images and their addresses in memory. This section has tags that mark the start and the end of this section.

# EMIT_IMAGE_ADDRESSES
#
#    IMAGE NAME                 LOW ADDRESS   HIGH ADDRESS
#
loops                           000000400000  00000040a147
/lib64/ld-linux-x86-64.so.2     2aaaaaaab000  2aaaaaacbfeb
[vdso]                          2aaaaaacf000  2aaaaaacfdba
/lib64/libm.so.6                2aaaf1c5c000  2aaaf1f58147
/lib64/libgcc_s.so.1            2aaaf1fdc000  2aaaf21f344f
/lib64/libc.so.6                2aaaf21f5000  2aaaf25999df
/lib64/libdl.so.2               2aaaf25a2000  2aaaf27a50ef
# END_IMAGE_ADDRESSES

The next sections are per thread data and include the top blocks stats, the dynamic stats histogram, the per function stats, the per function histogram.

# ==============================================
# STATS FOR TID 0  OS-TID  12005 EMIT# 1
# ==============================================
# EMIT_TOP_BLOCK_STATS FOR TID 0  OS-TID 12005 EMIT # 1 EVENT=ICOUNT
BLOCK:     0   PC: 0000000000401288   ICOUNT:   3500000   EXECUTIONS: 250000 #BYTES: 50  %: 21.7
               cumltv%:  21.7  FN: main  IMG: loops  OFFSET: 1288 Source: loops.c 39,38
XDIS 0000000000401288: BASE 89C7                     mov edi, eax
XDIS 000000000040128a: BASE B8D34D6210               mov eax, 0x10624dd3
XDIS 000000000040128f: BASE 89F9                     mov ecx, edi
XDIS 0000000000401291: BASE F7EF                     imul edi
XDIS 0000000000401293: BASE C1F91F                   sar ecx, 0x1f
XDIS 0000000000401296: BASE 41FFC7                   inc r15d
XDIS 0000000000401299: BASE C1FA06                   sar edx, 0x6
XDIS 000000000040129c: BASE 2BD1                     sub edx, ecx
XDIS 000000000040129e: BASE 69F218FCFFFF             imul esi, edx, 0xfffffc18
XDIS 00000000004012a4: BASE 03FE                     add edi, esi
XDIS 00000000004012a6: BASE 4189BD00C94000           mov dword ptr [r13+0x40c900], edi
XDIS 00000000004012ad: BASE 4983C504                 add r13, 0x4
XDIS 00000000004012b1: BASE 4181FFF4010000           cmp r15d, 0x1f4
XDIS 00000000004012b8: BASE 72C9                     jb 0x401283
...
# END_TOP_BLOCK_STATS
# EMIT_DYNAMIC_STATS FOR TID 0  OS-TID 12005 EMIT #1
#
# $dynamic-counts
#
# TID 0
#       opcode                 count
#
*stack-read                    1012276
*stack-write                    762328
*iprel-read                    1251592
*iprel-write                    500617
*mem-read-1                      10875
*mem-read-2                        820
*mem-read-4                    2006936
*mem-read-8                    2024321
*mem-read-16                     62675
...
SUB                             759384
SYSCALL                             52
TEST                            763928
XGETBV                               3
XOR                             502198
*total                        16097397
# END_DYNAMIC_STATS
# FUNCTION TOTALS FOR TID 0  OS-TID 12005
#rank total-icount    %  cumulative%   #times-called    address function-name    image-name
0:      6491936  40.329  40.329        250000      2aaaf222d480 random_r         IMG: /lib64/libc.so.6
1:      4250000  26.402  66.731        250000      2aaaf222d2f0 random           IMG: /lib64/libc.so.6
2:      3754701  23.325  90.056             1            401230 main             IMG: loops
3:      1000000   6.212  96.268        250000      2aaaf222d760 rand             IMG: /lib64/libc.so.6
4:       250002   1.553  97.821        250000            401130 rand             IMG: loops
5:       187533   1.165  98.986             1            4019d0 __intel_ssse3_rep_memcpy  IMG: loops
6:        36379   0.226  99.212           122      2aaaaaab3f10 do_lookup_x      IMG: /lib64/ld-linux-x86-64.so.2
7:        29750   0.185  99.397             6      2aaaaaab6390 _dl_relocate_object  IMG: /lib64/ld-linux-x86-64.so.2
8:        24866   0.154  99.551           122      2aaaaaab4b70 _dl_lookup_symbol_x  IMG: /lib64/ld-linux-x86-64.so.2
9:        18481   0.115  99.666          2595      2aaaaaac3cb0 strcmp           IMG: /lib64/ld-linux-x86-64.so.2
10:        5490   0.034  99.700           104      2aaaaaab3d70 check_match      IMG: /lib64/ld-linux-x86-64.so.2
        16097397 TOTAL
# END FUNCTION TOTALS

The end of the file include global stats which have global top blocks stats (accumulated over all the threads) Additional information can be attached to the mix output.

Mix Histogram Accounting

Intel® SDE mix tool count how many times each basic block is executed and at the end of the run generates the detailed report. The histogram has counts per instruction opcode (XED ICLASS) or per instruction opcode and its specific operands (XED IFORM). Together with the accounting of the instructions the mix tool groups instructions into various groups and emit their total count in a row marked with a star (‘*’).

Note

The same instruction might be counted multiple times in different groups.

Partial list of groups reported by the Mix tool

Group

Description

*sse-scalar

Scalar SSE instructions (the lowest element in xmm) e.g. ADDSD

*avx-scalar

Scalar AVX instructions e.g. VADDSD

*scalar-simd

Scalar instructions of any kind (some of all SSE, AVX and AVX-512 scalar instructions).

*avx128

AVX instructions which are not scalar and not AVX-256.

*avx256

AVX instructions with vector length 32 bytes.

*avx512

AVX-512 instructions with vector length 64 bytes.

*isa-ext-<E>

Instruction with XED ISA extension E.

*isa-set-<S>

Instruction with XED ISA set S.

*category-<C>

Instruction with XED category C.

*elements_fp_single_N

Vector instruction (i.e. SSE, AVX or AVX-512) with a single precision operand and the operand length is N elements.

*elements_fp_double_N

Vector instruction with double precision operand and the operand length is N elements.

*elements_fp_single_N_masked

Same with masked vector instruction (AVX-512).

*elements_fp_double_N_masked

Same with masked vector instruction (AVX-512).

The Intel® SDE kit has a text file called idata.txt that lists all the instructions supported by the kit. Each row in the file represents a single XED IFORM. This IFORM is the equivalent of an instruction and its operand. As a single instruction might have multiple flavors. (e.g. MOV has many flavors with GPRs, immediate, memory and more). This file is helpful in understanding which instructions belong to specific XED extension, XED category or XED ISA-set.

If you would like to know how many AVX instructions were executed then the best way is to use the *isa-ext-AVX or the *isa-set-AVX. These two should be the same.

Mix additional information

Intel® SDE provides additional analysis tools and add their output inside the mix output file.

Loop Analysis

Intel® SDE can use the dynamic control flow graph analysis tool to provide additional information about loops. To activate the loop analysis use the -mix_loops knob. The loop information can be emitted per thread or globally. The loop output is in its own section. The section is marked with tags and looks like:

# ==============================================
# LOOPS_STATS_FROM_DCFG_FOR_TID 0
# ==============================================
LOOP: 0     NUM BLOCKS: 4   HEAD BLOCK_ID: 28   ENTRIES: 1   EXECUTIONS: 500
EDGE FROM BLOCK_ID: 30 TO BLOCK_ID: 29   EXECUTIONS: 249500
EDGE FROM BLOCK_ID: 28 TO BLOCK_ID: 29   EXECUTIONS: 500
EDGE FROM BLOCK_ID: 30 TO BLOCK_ID: 31   EXECUTIONS: 500
EDGE FROM BLOCK_ID: 31 TO BLOCK_ID: 28   EXECUTIONS: 499

BLOCK_ID: 28 PC: 0040127d   ICOUNT:    1000 EXECUTIONS:    500 #BYTES:  6  FN: main  IMG: loops
XDIS 000000000040127d: BASE 4533FF       xor r15d, r15d
XDIS 0000000000401280: BASE 4989DD       mov r13, rbx

BLOCK_ID: 29 PC: 00401283   ICOUNT:  250000 EXECUTIONS: 250000 #BYTES:  5  FN: main  IMG: loops
XDIS 0000000000401283: BASE E8A8FEFFFF   call 0x401130

BLOCK_ID: 30 PC: 00401288   ICOUNT: 3500000 EXECUTIONS: 250000 #BYTES: 50  FN: main  IMG: loops
XDIS 0000000000401288: BASE 89C7         mov edi, eax
XDIS 000000000040128a: BASE B8D34D6210   mov eax, 0x10624dd3
...
END_LOOP: 0
# ==============================================