Edit: SSE

This is a quick reference for [[http://en.wikipedia.org/wiki/Streaming_SIMD_Extensions|Intel's Streaming SIMD Extensions]]. Feel free to make additions or corrections! = Vector types = The vector types here are named with the same convention as in [[http://docs.factorcode.org/content/article-math.vectors.simd.html|Factor's SIMD library]]. It should be obvious what they mean: - char-16 - uchar-16 - short-8 - ushort-8 - int-4 - uint-4 - longlong-2 - ulonglong-2 - float-4 - double-2 = Instruction set = The number next to each instruction is the SSE version: - 1: SSE - 2: [[http://en.wikipedia.org/wiki/SSE2|SSE2]] - 3: [[http://en.wikipedia.org/wiki/SSE3|SSE3]] - 3.3: [[http://en.wikipedia.org/wiki/SSSE3|SSSE3]] - 4.1: [[http://en.wikipedia.org/wiki/SSE4|SSE4.1]] - 4.2: [[http://en.wikipedia.org/wiki/SSE4|SSE4.2]] | |char-16 |uchar-16|short-8 |ushort-8|int-4 |uint-4 |longlong-2|ulonglong-2|float-4 |double-2 |move^\*^|MOVDQ[AU] 2 |MOVDQ[AU] 2 |MOVDQ[AU] 2 |MOVDQ[AU] 2 |MOVDQ[AU] 2 |MOVDQ[AU] 2 |MOVDQ[AU] 2 |MOVDQ[AU] 2 |MOV[AU]PS 1 |MOV[AU]PD 2 | |add |PADDB 2 |PADDB 2 |PADDW 2 |PADDW 2 |PADDD 2 |PADDD 2 |PADDQ 2|PADDQ 2|ADDPS 1 |ADDPD 2 | |subtract|PSUBB 2 |PSUBB 2 |PSUBW 2 |PSUBW 2 |PSUBD 2 |PSUBD 2 |PSUBQ 2|PSUBQ 2|SUBPS 1 |SUBPD 2 | |saturated add |PADDSB 2 |PADDUSB 2 |PADDSW 2 |PADDUSW 2 | | | | | | | |saturated subtract| PSUBSB 2 |PSUBUSB 2 |PSUBSW 2 |PSUBUSW 2 | | | | | | | |add-subtract| | | | | | | | |ADDSUBPS 3 |ADDSUBPD 3 | |horizontal add|||PHADDW 3.3|PHADDW 3.3|PHADDD 3.3|PHADDD 3.3|||HADDPS 3|HADDPD 3| |multiply| | | PMULLW 2 | PMULLW 2 |PMULLD 4.1 |PMULLD 4.1 |||MULPS 1 |MULPD 2| |divide| | | | | | |||DIVPS 1 |DIVPD 2 | |absolute value| PABSB 3.3 | | PABSW 3.3 | | PABSD 3.3 ||| | | | |minimum|PMINSB 4.1 | PMINUB 2 |PMINSW 2 |PMINUW 4.1 | PMINSD 4.1 |PMINUD 4.1 |||MINPS 1 |MINPD 2 | |maximum|PMAXSB 4.1 |PMAXUB 2 |PMAXSW 2 |PMAXUW 4.1 | PMAXSD 4.1 |PMAXUD 4.1 |||MAXPS 1 |MAXPD 2 | |approx reciprocal|||||||||RCPPS 1|| |square root|||||||||SQRTPS 1|SQRTPD 2| |comparison|PCMPxxB^†^ 2|PCMPxxB^†^ 2|PCMPxxW^†^ 2|PCMPxxW^†^ 2|PCMPxxD^†^ 2|PCMPxxD^†^ 2|PCMPxxQ^†^ 4.2|PCMPxxQ^†^ 4.2|CMPxxxPS^‡^ 1|CMPxxxPD^‡^ 2| |bitwise and|PAND 2|PAND 2|PAND 2|PAND 2|PAND 2|PAND 2|PAND 2|PAND 2|ANDPS 1|ANDPD 2| |bitwise and-not|PANDN 2|PANDN 2|PANDN 2|PANDN 2|PANDN 2|PANDN 2|PANDN 2|PANDN 2|ANDNPS 1|ANDNPD 2| |bitwise or|POR 2|POR 2|POR 2|POR 2|POR 2|POR 2|POR 2|POR 2|ORPS 1|ORPD 2| |bitwise xor|PXOR 2|PXOR 2|PXOR 2|PXOR 2|PXOR 2|PXOR 2|PXOR 2|PXOR 2|XORPS 1|XORPD 2| |bitwise test|PTEST 4.1|PTEST 4.1|PTEST 4.1|PTEST 4.1|PTEST 4.1|PTEST 4.1|PTEST 4.1|PTEST 4.1||| |load mask|PMOVMSKB 2|PMOVMSKB 2|PMOVMSKB 2|PMOVMSKB 2|PMOVMSKB 2|PMOVMSKB 2|PMOVMSKB 2|PMOVMSKB 2|MOVMSKPS 1|MOVMSKPD 2| |shift left|||PSLLW 2|PSLLW 2|PSLLD 2|PSLLD 2|PSLLQ 2|PSLLQ 2||| |shift right|||PSRAW 2|PSRLW 2|PSRAD 2|PSRLD 2||PSRLQ 2||| |unpack low|PUNPCKLBW 2|PUNPCKLBW 2|PUNPCKLWD 2|PUNPCKLWD 2|PUNPCKLDQ 2|PUNPCKLDQ 2|PUNPCKLQDQ 2|PUNPCKLQDQ 2|UNPCKLPS 1|UNPCKLPD 2| |unpack high|PUNPCKHBW 2|PUNPCKHBW 2|PUNPCKHWD 2|PUNPCKHWD 2|PUNPCKHDQ 2|PUNPCKHDQ 2|PUNPCKHQDQ 2|PUNPCKHQDQ 2|UNPCKHPS 1|UNPCKHPD 2| |static shuffle^§^|||PSHUF[HL]W^‖^ 2|PSHUF[HL]W^‖^ 2|PSHUFD 2|PSHUFD 2|PSHUFD 2|PSHUFD 2|SHUFPS^¶^ 1|SHUFPD^¶^ 2| |variable shuffle|PSHUFB 3.3|PSHUFB 3.3|PSHUFB 3.3|PSHUFB 3.3|PSHUFB 3.3|PSHUFB 3.3|PSHUFB 3.3|PSHUFB 3.3||| |static blend|||PBLENDW 4.1|PBLENDW 4.1|PBLENDW 4.1|PBLENDW 4.1|PBLENDW 4.1|PBLENDW 4.1|BLENDPS 4.1|BLENDPD 4.1| |variable blend^#^|PBLENDVB 4.1|PBLENDVB 4.1|PBLENDVB 4.1|PBLENDVB 4.1|PBLENDVB 4.1|PBLENDVB 4.1|PBLENDVB 4.1|PBLENDVB 4.1|BLENDVPS 4.1|BLENDVPD 4.1| Notes: - The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions. - There are many more instructions that do not fit in this grid, but these are the most important ones to know. - \* Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes. - † Equality (PCMPEQ\_) and signed greater-than (PCMPGT\_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPGT comparison in the opposite direction and invert the mask, either by using the "Bitwise NOT" idiom below or by reversing subsequent logic. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components. - ‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input. - § Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below. - ‖ 16-bit element shuffles only shuffle half of the register at a time. - ¶ Floating-point shuffles select the high element(s) from the source register and the low element(s) from the destination. To shuffle a single vector, use the same register for source and destination. - # Variable blends take the blend mask from XMM0 as an implicit operand. = Idioms = ___ == all integer types == === Clear all bits === [{pxor xmm0, xmm0}] === Set all bits === [{pcmpeqb xmm0, xmm0}] === Bitwise NOT === [{pcmpeqb xmm0, xmm0 pxor xmm0, xmm1}] === Blend without SSE 4.1 === [{; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 ; blended result is in xmm0 pand xmm1, xmm0 pandn xmm0, xmm2 por xmm0, xmm1}] === "Any" test === Using PTEST: (requires SSE 4.1) [{ptest xmm0, xmm0 jnz any}] Using PMOVMSKB: [{pmovmskb eax, xmm0 test eax, eax jnz any}] === "None" test === Using PTEST: (requires SSE 4.1) [{ptest xmm0, xmm0 jz none}] Using PMOVMSKB: [{pmovmskb eax, xmm0 test eax, eax jz none}] === "All" test === Using PTEST: (requires SSE 4.1) [{pcmpeqb xmm1, xmm1 ptest xmm0, xmm1 jc all}] Using PMOVMSKB: [{pmovmskb eax, xmm0 not ax, ax test ax, ax jz all}] ___ == int-4 == === Select nth component === Directly to a GPR: (requires SSE 4.1) [{pextrd eax, xmm0, n}] To low element of an XMM register: [{pshufd xmm0, xmm1, n}] Use %movd eax, xmm0% to move the selected element to a GPR. === Gather four integers into a vector === Directly from GPRs: (requires SSE 4.1) [{pinsrd xmm0, r8d, 0 pinsrd xmm0, r9d, 1 pinsrd xmm0, r10d, 2 pinsrd xmm0, r11d, 3}] From low elements of XMM registers: [{punpckldq xmm0, xmm1 ; xmm0 => 0 1 ? ? punpckldq xmm2, xmm3 ; xmm2 => 2 3 ? ? punpcklqdq xmm0, xmm2 ; xmm0 => 0 1 2 3}] Use %movd xmm0, eax% to load the low element from a GPR. === Horizontal add without SSSE3 === [{pshufd xmm0, xmm1, 0xb1 ; 1 0 3 2 paddd xmm0, xmm1 pshufd xmm1, xmm0, 0x0a ; 2 2 0 0 paddd xmm0, xmm1}] Replace %paddd% with any 32-bit integer operation to perform it horizontally. === Special shuffles === %pshufd% selects the entire destination register from the source, unlike %shufps% and %shufpd% which select half from the destination, half from the source. Because of this, it doesn't need an initial %movdqa% to be useful in most cases. Therefore, %pshufd% will be a size win over these instructions (which do require a move to be nondestructive) unless you're destructively replacing your input. |order|code| |0 0 1 1|%punpckldq dst, dst%| |2 2 3 3|%punpckhdq dst, dst%| |0 1 0 1|%punpcklqdq dst, dst%| |2 3 2 3|%punpckhqdq dst, dst%| ___ == float-4 == === Clear all bits === [{xorps xmm0, xmm0}] === Set all bits === [{cmpnltps xmm0, xmm0}] %cmpeqps% cannot be used because NaN != NaN. Note that %cmpnltps% will set off the Invalid floating-point exception if the register being filled contains NaNs. If you care about the floating-point environment state, clear the register first: [{xorps xmm0, xmm0 cmpeqps xmm0, xmm0}] === Bitwise NOT === [{cmpnltps xmm0, xmm0 xorps xmm0, xmm1}] === Select nth component === Element 0 is a no-op: [{movss dst, src}] Element 1: [{movshdup dst, src}] Element 2: [{movhlps dst, src}] Element 3: [{movaps dst, src shufps dst, dst, 0xff ; 3 3 3 3}] === Gather four floats into a vector === [{unpcklps xmm0, xmm1 ; xmm0 => 0 1 ? ? unpcklps xmm2, xmm3 ; xmm2 => 2 3 ? ? movlhps xmm0, xmm2 ; xmm0 => 0 1 2 3}] === Broadcast float into four components === [{movaps dst, src shufps dst, dst, n}] Where %n% selects the element: |element|%n%| |0|%0x00%| |1|%0x55%| |2|%0xaa%| |3|%0xff%| === Absolute value === Using a vector constant %negative\_zeroes\_f = { -0.0f, -0.0f, -0.0f, -0.0f }% from memory: [{movaps xmm0, negative_zeroes_f andnps xmm0, xmm1}] === Negation === Using a vector constant %negative\_zeroes\_f = { -0.0, -0.0f, -0.0f, -0.0f }% from memory: [{movaps xmm0, negative_zeroes_f xorps xmm0, xmm1}] === Horizontal add without SSE3 === [{movaps xmm1, xmm0 shufps xmm0, xmm1, 0xb1 ; 1 0 3 2 addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm0, xmm0, 0x0a ; 2 2 0 0 addps xmm0, xmm1}] Replace %addps% with any vector instruction to perform it horizontally. === Blend without SSE 4.1 === [{; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 ; blended result is in xmm0 andps xmm1, xmm0 andnps xmm0, xmm2 orps xmm0, xmm1}] === "Any" test === [{movmskps eax, xmm0 test eax, eax jnz any}] === "None" test === [{movmskps eax, xmm0 test eax, eax jz none}] === "All" test === [{movmskps eax, xmm0 not eax, eax test eax, 0xf jz all}] === Special shuffles === |order|code| |0 0 2 2|%movsldup dst, src%| |1 1 3 3|%movshdup dst, src%| |0 1 0 1|%movlhps dst, dst%| |2 3 2 3|%movhlps dst, dst%| |0 0 1 1|%unpcklps dst, dst%| |2 2 3 3|%unpckhps dst, dst%| ___ == double-2 == === Clear all bits === [{xorpd xmm0, xmm0}] === Set all bits === [{cmpnltpd xmm0, xmm0}] %cmpeqpd% cannot be used because NaN != NaN. Note that %cmpnltpd% will set off the Invalid floating-point exception if the register being filled contains NaNs. If you care about the floating-point environment state, clear the register first: [{xorpd xmm0, xmm0 cmpeqpd xmm0, xmm0}] === Bitwise NOT === [{cmpnltpd xmm0, xmm0 xorpd xmm0, xmm1}] === Select nth component === Element 0 is a no-op: [{movsd xmm0, xmm1}] Element 1: [{movapd xmm0, xmm1 unpckhpd xmm0, xmm0}] === Gather two doubles into a vector === [{unpcklpd xmm0, xmm1}] === Broadcast double into two components === Element 0: [{movddup xmm0, xmm1}] Element 1: [{movapd xmm0, xmm1 unpckhpd xmm0, xmm0}] === Absolute value === Using a vector constant %negative\_zeroes\_d = { -0.0, -0.0 }% from memory: [{movapd xmm0, negative_zeroes_d andnpd xmm0, xmm1}] === Negation === Using a vector constant %negative\_zeroes\_d = { -0.0, -0.0 }% from memory: [{movapd xmm0, negative_zeroes_d xorpd xmm0, xmm1}] === Horizontal add without SSE3 === [{movapd xmm1, xmm0 unpckhpd xmm1, xmm1 addsd xmm0, xmm1}] Replace %addsd% with any double-precision instruction to perform it horizontally. === Blend without SSE 4.1 === [{; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 ; blended result is in xmm0 andpd xmm1, xmm0 andnpd xmm0, xmm2 orpd xmm0, xmm1}] === "Any" test === [{movmskpd eax, xmm0 test eax, eax jnz any}] === "None" test === [{movmskpd eax, xmm0 test eax, eax jz none}] === "All" test === [{movmskpd eax, xmm0 not eax, eax test eax, 0x3 jz all}] === Special shuffles === |order|code| |0 0|%unpcklpd dst, dst% *or* %movddup dst, src%| |1 1|%unpckhpd dst, dst%| = References = For full details, consult Intel's or AMD's instruction set reference documentation.

Describe this revision:

Contents

Edit: SSE