Detecting AVX-SSE Transitions
Intel® AVX instructions are using the YMM registers, while the Intel® SSE instructions access the XMM registers which are the low 128 bits of the YMM registers.
Since, Intel® SSE instruction preserve the upper bits of the destination register upper bits, a transition between AVX instructions and SSE instruction involves huge state transition penalty.
The best way to avoid this penalty is to make sure all the code is written in AVX and this way avoid the transitions. The VZEROUPPER instruction can be used to avoid the penalty when calling an external library function that might use SSE.
Detecting these transitions in a real world application is not easy. The CPU has performance monitoring counters to detect that your application has such transitions, but this doesn’t help finding where are these transitions in the code.
Intel® SDE provides an analysis tool to detect these transitions. It tracks all the execution flow and collects the transitions locations. At the end, it reports in which basic block the transition happend, the dynamic count of transitions and the dynamic count of instructions within the block.
The following C function with inline assembly has AVX-SSE transition:
void foo()
{
int i;
for (i=0; i<100; i++)
{
__asm {
vmovaps ymm1, src // avx
vmovaps ymm2, dst // avx
addps xmm2, xmm1 // sse
vmovaps dst, ymm2 // avx
}
}
}
Run Intel® SDE on the application with this code.
% sde -hsw -ast -- foo.exe
The output file has header with explaination and the analysis results.
# ===================================================
# AVX/SSE transition checker
#
# 'Penalty in Block' provides the address (rIP) of the code basic block with
# the penaties.
#
# 'Dynamic AVX to SSE Transition' counts the number of potentially
# costly AVX-to-SSE sequences
#
# 'Dynamic SSE to AVX Transition' counts the number of potentially
# costly SSE-to-AVX sequences
#
# 'Static Icount' is the static number instructions in the block
#
# 'Executions' is the dynamic number of times the block was executed
#
# 'Dynamic Icount' is the product of the static icount and executions columns
#
# 'Previous Block' is an attempt to find the previous control flow block
#
# 'State Change Block' is an attempt to find the block that put the
# state machine in a state that conflicted with this block, causing a
# transition in this block
#
# ===================================================
Penalty Dynamic Dynamic State
in AVX to SSE SSE to AVX Static Dynamic Previous Change
Block Transition Transition Icount Executions Icount Block Block
============ =========== ============ ======== ========== ======== ========= =========
0x40115a 100 100 5 100 500 N/A N/A
#Initial state from routine: not-found @ 0
#Previous block in routine: not-found @ 0
#Penalty detected in routine: foo @ 0x40115a
0x40124e 1 0 13 12 156 0x40123e 0x40115a
#Initial state from routine: foo @ 0x40115a
#Previous block in routine: dump @ 0x40123e
#Penalty detected in routine: dump @ 0x40124e
# SUMMARY
# AVX_to_SSE_transition_instances: 101
# SSE_to_AVX_transition_instances: 100
# Dynamic_insts: 187111
# AVX_to_SSE_instances/instruction: 0.0005
# SSE_to_AVX_instances/instruction: 0.0005
# AVX_to_SSE_instances/100instructions: 0.0540
# SSE_to_AVX_instances/100instructions: 0.0534
The knobs for this analisys tool.
- -ast
Identify slow Intel AVX-to-SSE and SSE-to-Intel AVX transitions [default 0]
- -ast_trace
Trace Intel AVX-to-SSE and SSE-to-Intel AVX transitions, may result in a large output file [default 0]
- -oast
Specify Intel-AVX/SSE transition file name [default sde-avx-sse-transition-out.txt]