Detecting AVX-SSE Transitions

Intel® AVX instructions are using the YMM registers, while the Intel® SSE instructions access the XMM registers which are the low 128 bits of the YMM registers.

Since, Intel® SSE instruction preserve the upper bits of the destination register upper bits, a transition between AVX instructions and SSE instruction involves huge state transition penalty.

The best way to avoid this penalty is to make sure all the code is written in AVX and this way avoid the transitions. The VZEROUPPER instruction can be used to avoid the penalty when calling an external library function that might use SSE.

Detecting these transitions in a real world application is not easy. The CPU has performance monitoring counters to detect that your application has such transitions, but this doesn’t help finding where are these transitions in the code.

Intel® SDE provides an analysis tool to detect these transitions. It tracks all the execution flow and collects the transitions locations. At the end, it reports in which basic block the transition happend, the dynamic count of transitions and the dynamic count of instructions within the block.

The following C function with inline assembly has AVX-SSE transition:

void foo()
{
    int i;
    for (i=0; i<100; i++)
    {
        __asm {
            vmovaps ymm1, src       // avx
            vmovaps ymm2, dst       // avx
            addps xmm2, xmm1        // sse
            vmovaps dst, ymm2       // avx
        }
    }
}

Run Intel® SDE on the application with this code.

% sde -hsw -ast -- foo.exe

The output file has header with explaination and the analysis results.

# ===================================================
# AVX/SSE transition checker
#
# 'Penalty in Block' provides the address (rIP) of the code basic block with
#       the penaties.
#
# 'Dynamic AVX to SSE Transition' counts the number of potentially
#       costly AVX-to-SSE sequences
#
# 'Dynamic SSE to AVX  Transition' counts the number of potentially
#       costly SSE-to-AVX sequences
#
# 'Static Icount' is the static number instructions in the block
#
# 'Executions' is the dynamic number of times the block was executed
#
# 'Dynamic Icount' is the product of the static icount and executions columns
#
# 'Previous Block' is an attempt to find the  previous control flow block
#
# 'State Change Block' is an attempt to find the block that put the
#       state machine in a state that conflicted with this block, causing a
#       transition in this block
#
# ===================================================
    Penalty    Dynamic      Dynamic                                            State
      in     AVX to SSE   SSE to AVX   Static             Dynamic   Previous   Change
     Block   Transition   Transition   Icount Executions   Icount     Block    Block
============ =========== ============ ======== ========== ======== ========= =========
    0x40115a     100          100        5        100       500        N/A      N/A
#Initial state from routine:  not-found @ 0
#Previous block in routine:   not-found @ 0
#Penalty detected in routine: foo @ 0x40115a

    0x40124e       1            0       13         12       156     0x40123e   0x40115a
#Initial state from routine:  foo @ 0x40115a
#Previous block in routine:   dump @ 0x40123e
#Penalty detected in routine: dump @ 0x40124e

# SUMMARY
# AVX_to_SSE_transition_instances:        101
# SSE_to_AVX_transition_instances:        100
# Dynamic_insts:                          187111
# AVX_to_SSE_instances/instruction:       0.0005
# SSE_to_AVX_instances/instruction:       0.0005
# AVX_to_SSE_instances/100instructions:   0.0540
# SSE_to_AVX_instances/100instructions:   0.0534

The knobs for this analisys tool.

-ast

Identify slow Intel AVX-to-SSE and SSE-to-Intel AVX transitions [default 0]

-ast_trace

Trace Intel AVX-to-SSE and SSE-to-Intel AVX transitions, may result in a large output file [default 0]

-oast

Specify Intel-AVX/SSE transition file name [default sde-avx-sse-transition-out.txt]