This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!
Vector types
The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:
- char-16
- uchar-16
- short-8
- ushort-8
- int-4
- uint-4
- longlong-2
- ulonglong-2
- float-4
- double-2
Instruction set
The number next to each instruction is the SSE version:
| | char-16 | uchar-16 | short-8 | ushort-8 | int-4 | uint-4 | longlong-2 | ulonglong-2 | float-4 | double-2 |
| move* | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOV[AU]PS 1 | MOV[AU]PD 2 |
| add | PADDB 2 | PADDB 2 | PADDW 2 | PADDW 2 | PADDD 2 | PADDD 2 | PADDQ 2 | PADDQ 2 | ADDPS 1 | ADDPD 2 |
| subtract | PSUBB 2 | PSUBB 2 | PSUBW 2 | PSUBW 2 | PSUBD 2 | PSUBD 2 | PSUBQ 2 | PSUBQ 2 | SUBPS 1 | SUBPD 2 |
| saturated add | PADDSB 2 | PADDUSB 2 | PADDSW 2 | PADDUSW 2 | | | | | | |
| saturated subtract | PSUBSB 2 | PSUBUSB 2 | PSUBSW 2 | PSUBUSW 2 | | | | | | |
| add-subtract | | | | | | | | | ADDSUBPS 3 | ADDSUBPD 3 |
| horizontal add | | | PHADDW 3.3 | PHADDW 3.3 | PHADDD 3.3 | PHADDD 3.3 | | | HADDPS 3 | HADDPD 3 |
| multiply | | | PMULLW 2 | PMULLW 2 | PMULLD 4.1 | PMULLD 4.1 | | | MULPS 1 | MULPD 2 |
| divide | | | | | | | | | DIVPS 1 | DIVPD 2 |
| absolute value | PABSB 3.3 | | PABSW 3.3 | | PABSD 3.3 | | | | | |
| minimum | PMINSB 4.1 | PMINUB 2 | PMINSW 2 | PMINUW 4.1 | PMINSD 4.1 | PMINUD 4.1 | | | MINPS 1 | MINPD 2 |
| maximum | PMAXSB 4.1 | PMAXUB 2 | PMAXSW 2 | PMAXUW 4.1 | PMAXSD 4.1 | PMAXUD 4.1 | | | MAXPS 1 | MAXPD 2 |
| approx reciprocal | | | | | | | | | RCPPS 1 | |
| square root | | | | | | | | | SQRTPS 1 | SQRTPD 2 |
| comparison | PCMPxxB† 2 | PCMPxxB† 2 | PCMPxxW† 2 | PCMPxxW† 2 | PCMPxxD† 2 | PCMPxxD† 2 | PCMPxxQ† 4.2 | PCMPxxQ† 4.2 | CMPxxxPS‡ 1 | CMPxxxPD‡ 2 |
| bitwise and | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | ANDPS 1 | ANDPD 2 |
| bitwise and-not | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | ANDNPS 1 | ANDNPD 2 |
| bitwise or | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | ORPS 1 | ORPD 2 |
| bitwise xor | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | XORPS 1 | XORPD 2 |
| bitwise test | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | | |
| load mask | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | MOVMSKPS 1 | MOVMSKPD 2 |
| shift left | | | PSLLW 2 | PSLLW 2 | PSLLD 2 | PSLLD 2 | PSLLQ 2 | PSLLQ 2 | | |
| shift right | | | PSRAW 2 | PSRLW 2 | PSRAD 2 | PSRLD 2 | | PSRLQ 2 | | |
| unpack low | PUNPCKLBW 2 | PUNPCKLBW 2 | PUNPCKLWD 2 | PUNPCKLWD 2 | PUNPCKLDQ 2 | PUNPCKLDQ 2 | PUNPCKLQDQ 2 | PUNPCKLQDQ 2 | UNPCKLPS 1 | UNPCKLPD 2 |
| unpack high | PUNPCKHBW 2 | PUNPCKHBW 2 | PUNPCKHWD 2 | PUNPCKHWD 2 | PUNPCKHDQ 2 | PUNPCKHDQ 2 | PUNPCKHQDQ 2 | PUNPCKHQDQ 2 | UNPCKHPS 1 | UNPCKHPD 2 |
| static shuffle§ | | | PSHUF[HL]W‖ 2 | PSHUF[HL]W‖ 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | SHUFPS¶ 1 | SHUFPD¶ 2 |
| variable shuffle | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | | |
| static blend | | | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | BLENDPS 4.1 | BLENDPD 4.1 |
| variable blend# | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | BLENDVPS 4.1 | BLENDVPD 4.1 |
Notes:
- The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
- There are many more instructions that do not fit in this grid, but these are the most important ones to know.
- * Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
- † Equality (PCMPEQ_) and signed greater-than (PCMPGT_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPGT comparison in the opposite direction and invert the mask, either by using the "Bitwise NOT" idiom below or by reversing subsequent logic. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
- ‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input.
- § Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below.
- ‖ 16-bit element shuffles only shuffle half of the register at a time.
- ¶ Floating-point shuffles select the high element(s) from the source register and the low element(s) from the destination. To shuffle a single vector, use the same register for source and destination.
- # Variable blends take the blend mask from XMM0 as an implicit operand.
Idioms
all integer types
Clear all bits
pxor xmm0, xmm0
Set all bits
pcmpeqb xmm0, xmm0
Bitwise NOT
pcmpeqb xmm0, xmm0
pxor xmm0, xmm1
Blend without SSE 4.1
; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
pand xmm1, xmm0
pandn xmm0, xmm2
por xmm0, xmm1
"Any" test
Using PTEST: (requires SSE 4.1)
ptest xmm0, xmm0
jnz any
Using PMOVMSKB:
pmovmskb eax, xmm0
test eax, eax
jnz any
"None" test
Using PTEST: (requires SSE 4.1)
ptest xmm0, xmm0
jz none
Using PMOVMSKB:
pmovmskb eax, xmm0
test eax, eax
jz none
"All" test
Using PTEST: (requires SSE 4.1)
pcmpeqb xmm1, xmm1
ptest xmm0, xmm1
jc all
Using PMOVMSKB:
pmovmskb eax, xmm0
not ax, ax
test ax, ax
jz all
int-4
Select nth component
Directly to a GPR: (requires SSE 4.1)
pextrd eax, xmm0, n
To low element of an XMM register:
pshufd xmm0, xmm1, n
Use movd eax, xmm0 to move the selected element to a GPR.
Gather four integers into a vector
Directly from GPRs: (requires SSE 4.1)
pinsrd xmm0, r8d, 0
pinsrd xmm0, r9d, 1
pinsrd xmm0, r10d, 2
pinsrd xmm0, r11d, 3
From low elements of XMM registers:
punpckldq xmm0, xmm1 ; xmm0 => 0 1 ? ?
punpckldq xmm2, xmm3 ; xmm2 => 2 3 ? ?
punpcklqdq xmm0, xmm2 ; xmm0 => 0 1 2 3
Use movd xmm0, eax to load the low element from a GPR.
Horizontal add without SSSE3
pshufd xmm0, xmm1, 0xb1 ; 1 0 3 2
paddd xmm0, xmm1
pshufd xmm1, xmm0, 0x0a ; 2 2 0 0
paddd xmm0, xmm1
Replace paddd with any 32-bit integer operation to perform it horizontally.
Special shuffles
pshufd selects the entire destination register from the source, unlike shufps and shufpd which select half from the destination, half from the source. Because of this, it doesn't need an initial movdqa to be useful in most cases. Therefore, pshufd will be a size win over these instructions (which do require a move to be nondestructive) unless you're destructively replacing your input.
| order | code |
| 0 0 1 1 | punpckldq dst, dst |
| 2 2 3 3 | punpckhdq dst, dst |
| 0 1 0 1 | punpcklqdq dst, dst |
| 2 3 2 3 | punpckhqdq dst, dst |
float-4
Clear all bits
xorps xmm0, xmm0
Set all bits
cmpnltps xmm0, xmm0
cmpeqps cannot be used because NaN != NaN. Note that cmpnltps will set off the Invalid floating-point exception if the register being filled contains NaNs. If you care about the floating-point environment state, clear the register first:
xorps xmm0, xmm0
cmpeqps xmm0, xmm0
Bitwise NOT
cmpnltps xmm0, xmm0
xorps xmm0, xmm1
Select nth component
Element 0 is a no-op:
movss dst, src
Element 1:
movshdup dst, src
Element 2:
movhlps dst, src
Element 3:
movaps dst, src
shufps dst, dst, 0xff ; 3 3 3 3
Gather four floats into a vector
unpcklps xmm0, xmm1 ; xmm0 => 0 1 ? ?
unpcklps xmm2, xmm3 ; xmm2 => 2 3 ? ?
movlhps xmm0, xmm2 ; xmm0 => 0 1 2 3
Broadcast float into four components
movaps dst, src
shufps dst, dst, n
Where n selects the element:
| element | n |
| 0 | 0x00 |
| 1 | 0x55 |
| 2 | 0xaa |
| 3 | 0xff |
Absolute value
Using a vector constant negative_zeroes_f = { -0.0f, -0.0f, -0.0f, -0.0f } from memory:
movaps xmm0, negative_zeroes_f
andnps xmm0, xmm1
Negation
Using a vector constant negative_zeroes_f = { -0.0, -0.0f, -0.0f, -0.0f } from memory:
movaps xmm0, negative_zeroes_f
xorps xmm0, xmm1
Horizontal add without SSE3
movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1 ; 1 0 3 2
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a ; 2 2 0 0
addps xmm0, xmm1
Replace addps with any vector instruction to perform it horizontally.
Blend without SSE 4.1
; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
andps xmm1, xmm0
andnps xmm0, xmm2
orps xmm0, xmm1
"Any" test
movmskps eax, xmm0
test eax, eax
jnz any
"None" test
movmskps eax, xmm0
test eax, eax
jz none
"All" test
movmskps eax, xmm0
not eax, eax
test eax, 0xf
jz all
Special shuffles
| order | code |
| 0 0 2 2 | movsldup dst, src |
| 1 1 3 3 | movshdup dst, src |
| 0 1 0 1 | movlhps dst, dst |
| 2 3 2 3 | movhlps dst, dst |
| 0 0 1 1 | unpcklps dst, dst |
| 2 2 3 3 | unpckhps dst, dst |
double-2
Clear all bits
xorpd xmm0, xmm0
Set all bits
cmpnltpd xmm0, xmm0
cmpeqpd cannot be used because NaN != NaN. Note that cmpnltpd will set off the Invalid floating-point exception if the register being filled contains NaNs. If you care about the floating-point environment state, clear the register first:
xorpd xmm0, xmm0
cmpeqpd xmm0, xmm0
Bitwise NOT
cmpnltpd xmm0, xmm0
xorpd xmm0, xmm1
Select nth component
Element 0 is a no-op:
movsd xmm0, xmm1
Element 1:
movapd xmm0, xmm1
unpckhpd xmm0, xmm0
Gather two doubles into a vector
unpcklpd xmm0, xmm1
Broadcast double into two components
Element 0:
movddup xmm0, xmm1
Element 1:
movapd xmm0, xmm1
unpckhpd xmm0, xmm0
Absolute value
Using a vector constant negative_zeroes_d = { -0.0, -0.0 } from memory:
movapd xmm0, negative_zeroes_d
andnpd xmm0, xmm1
Negation
Using a vector constant negative_zeroes_d = { -0.0, -0.0 } from memory:
movapd xmm0, negative_zeroes_d
xorpd xmm0, xmm1
Horizontal add without SSE3
movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1
Replace addsd with any double-precision instruction to perform it horizontally.
Blend without SSE 4.1
; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
andpd xmm1, xmm0
andnpd xmm0, xmm2
orpd xmm0, xmm1
"Any" test
movmskpd eax, xmm0
test eax, eax
jnz any
"None" test
movmskpd eax, xmm0
test eax, eax
jz none
"All" test
movmskpd eax, xmm0
not eax, eax
test eax, 0x3
jz all
Special shuffles
| order | code |
| 0 0 | unpcklpd dst, dst or movddup dst, src |
| 1 1 | unpckhpd dst, dst |
References
For full details, consult Intel's or AMD's instruction set reference documentation.