This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!
Vector types
The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:
- char-16
- uchar-16
- short-8
- ushort-8
- int-4
- uint-4
- longlong-2
- ulonglong-2
- float-4
- double-2
Instruction set
The number next to each instruction is the SSE version:
| char-16 | uchar-16 | short-8 | ushort-8 | int-4 | uint-4 | longlong-2 | ulonglong-2 | float-4 | double-2 |
move^{*} | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOV[AU]PS 1 | MOV[AU]PD 2 |
add | PADDB 2 | PADDB 2 | PADDW 2 | PADDW 2 | PADDD 2 | PADDD 2 | PADDQ 2 | PADDQ 2 | ADDPS 1 | ADDPD 2 |
subtract | PSUBB 2 | PSUBB 2 | PSUBW 2 | PSUBW 2 | PSUBD 2 | PSUBD 2 | PSUBQ 2 | PSUBQ 2 | SUBPS 1 | SUBPD 2 |
saturated add | PADDSB 2 | PADDUSB 2 | PADDSW 2 | PADDUSW 2 | | | | | | |
saturated subtract | PSUBSB 2 | PSUBUSB 2 | PSUBSW 2 | PSUBUSW 2 | | | | | | |
add-subtract | | | | | | | | | ADDSUBPS 3 | ADDSUBPD 3 |
horizontal add | | | PHADDW 3.3 | PHADDW 3.3 | PHADDD 3.3 | PHADDD 3.3 | | | HADDPS 3 | HADDPD 3 |
multiply | | | PMULLW 2 | PMULLW 2 | PMULLD 4.1 | PMULLD 4.1 | | | MULPS 1 | MULPD 2 |
divide | | | | | | | | | DIVPS 1 | DIVPD 2 |
absolute value | PABSB 3.3 | | PABSW 3.3 | | PABSD 3.3 | | | | | |
minimum | PMINSB 4.1 | PMINUB 2 | PMINSW 2 | PMINUW 4.1 | PMINSD 4.1 | PMINUD 4.1 | | | MINPS 1 | MINPD 2 |
maximum | PMAXSB 4.1 | PMAXUB 2 | PMAXSW 2 | PMAXUW 4.1 | PMAXSD 4.1 | PMAXUD 4.1 | | | MAXPS 1 | MAXPD 2 |
approx reciprocal | | | | | | | | | RCPPS 1 | |
square root | | | | | | | | | SQRTPS 1 | SQRTPD 2 |
comparison | PCMPxxB^{†} 2 | PCMPxxB^{†} 2 | PCMPxxW^{†} 2 | PCMPxxW^{†} 2 | PCMPxxD^{†} 2 | PCMPxxD^{†} 2 | | | CMPxxxPS^{‡} 1 | CMPxxxPD^{‡} 2 |
bitwise and | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | ANDPS 1 | ANDPD 2 |
bitwise and-not | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | ANDNPS 1 | ANDNPD 2 |
bitwise or | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | ORPS 1 | ORPD 2 |
bitwise xor | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | XORPS 1 | XORPD 2 |
load mask | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | MOVMSKPS 1 | MOVMSKPD 2 |
shift left | | | PSLLW 2 | PSLLW 2 | PSLLD 2 | PSLLD 2 | PSLLQ 2 | PSLLQ 2 | | |
shift right | | | PSRAW 2 | PSRLW 2 | PSRAD 2 | PSRLD 2 | | PSRLQ 2 | | |
unpack low | PUNPCKLBW 2 | PUNPCKLBW 2 | PUNPCKLWD 2 | PUNPCKLWD 2 | PUNPCKLDQ 2 | PUNPCKLDQ 2 | PUNPCKLQDQ 2 | PUNPCKLQDQ 2 | UNPCKLPS 1 | UNPCKLPD 2 |
unpack high | PUNPCKHBW 2 | PUNPCKHBW 2 | PUNPCKHWD 2 | PUNPCKHWD 2 | PUNPCKHDQ 2 | PUNPCKHDQ 2 | PUNPCKHQDQ 2 | PUNPCKHQDQ 2 | UNPCKHPS 1 | UNPCKHPD 2 |
static shuffle^{§} | | | PSHUF[HL]W^{‖} 2 | PSHUF[HL]W^{‖} 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | SHUFPS^{¶} 1 | SHUFPD^{¶} 2 |
variable shuffle | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | | |
static blend | | | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | BLENDPS 4.1 | BLENDPD 4.1 |
variable blend^{#} | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | BLENDVPS 4.1 | BLENDVPD 4.1 |
Notes:
- The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
- There are many more instructions that do not fit in this grid, but these are the most important ones to know.
- * Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
- † Equality (PCMPEQ_) and signed greater-than (PCMPGT_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPEQ and PCMPGT comparisons and POR the results together. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
- ‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input.
- § Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below.
- ‖ 16-bit element shuffles only shuffle half of the register at a time.
- ¶ Floating-point shuffles select the high element(s) from the source register and the low element(s) from the destination. To shuffle a single vector, use the same register for source and destination.
- # Variable blends take the blend mask from XMM0 as an implicit operand.
Idioms
all integer types
Blend without SSE 4.1
; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
pand xmm1, xmm0
pandn xmm0, xmm2
por xmm0, xmm1
int-4
Select nth component
Directly to a GPR: (requires SSE 4.1)
pextrd eax, xmm0, n
To low element of an XMM register:
pshufd xmm0, xmm1, n
Use movd eax, xmm0
to move the selected element to a GPR.
Gather four integers into a vector
Directly from GPRs: (requires SSE 4.1)
pinsrd xmm0, r8d, 0
pinsrd xmm0, r9d, 1
pinsrd xmm0, r10d, 2
pinsrd xmm0, r11d, 3
From low elements of XMM registers:
punpckldq xmm0, xmm1 ; xmm0 => 0 1 ? ?
punpckldq xmm2, xmm3 ; xmm2 => 2 3 ? ?
punpcklqdq xmm0, xmm2 ; xmm0 => 0 1 2 3
Use movd xmm0, eax
to load the low element from a GPR.
Special shuffles
pshufd
selects the entire destination register from the source, unlike shufps
and shufpd
which select half from the destination, half from the source, so it doesn't need an initial movdqa
to be useful in most cases. Therefore, pshufd
will still be a size win over these instructions (which do require a move to be nondestructive) unless you're destructively replacing your input.
order | code |
0 0 1 1 | punpckldq dst, dst |
2 2 3 3 | punpckhdq dst, dst |
0 1 0 1 | punpcklqdq dst, dst |
2 3 2 3 | punpckhqdq dst, dst |
float-4
Select nth component
Element 0 is a no-op:
movss dst, src
Element 1:
movshdup dst, src
Element 2:
movhlps dst, src
Element 3:
movaps dst, src
shufps dst, dst, 0xff ; 3 3 3 3
Gather four floats into a vector
unpcklps xmm0, xmm1 ; xmm0 => 0 1 ? ?
unpcklps xmm2, xmm3 ; xmm2 => 2 3 ? ?
movlhps xmm0, xmm2 ; xmm0 => 0 1 2 3
Broadcast float into four components
movaps dst, src
shufps dst, dst, n
Where n
selects the element:
element | n |
0 | 0x00 |
1 | 0x55 |
2 | 0xaa |
3 | 0xff |
Absolute value
Horizontal add with SSE2
movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1 ; 1 0 3 2
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a ; 2 2 0 0
addps xmm0, xmm1
Blend without SSE 4.1
; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
andps xmm1, xmm0
andnps xmm0, xmm2
orps xmm0, xmm1
Special shuffles
order | code |
0 0 2 2 | movsldup dst, src |
1 1 3 3 | movshdup dst, src |
0 1 0 1 | movlhps dst, dst |
2 3 2 3 | movhlps dst, dst |
0 0 1 1 | unpcklps dst, dst |
2 2 3 3 | unpckhps dst, dst |
double-2
Select nth component
Gather two doubles into a vector
unpcklpd xmm0, xmm1
Broadcast double into two components
Element 0:
movddup xmm0, xmm1
Element 1:
movapd xmm0, xmm1
unpckhpd xmm0, xmm0
Absolute value
Horizontal add with SSE2
movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1
Blend without SSE 4.1
; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
andpd xmm1, xmm0
andnpd xmm0, xmm2
orpd xmm0, xmm1
Special shuffles
order | code |
0 0 | unpcklpd dst, dst or movddup dst, src |
1 1 | unpckhpd dst, dst |
References
For full details, consult Intel's or AMD's instruction set reference documentation.