SSE

This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!

Vector types

The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:

  • char-16
  • uchar-16
  • short-8
  • ushort-8
  • int-4
  • uint-4
  • longlong-2
  • ulonglong-2
  • float-4
  • double-2

Instruction set

The number next to each instruction is the SSE version:

char-16 uchar-16short-8 ushort-8int-4 uint-4 longlong-2ulonglong-2float-4 double-2
move*MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOV[AU]PS 1 MOV[AU]PD 2
add PADDB 2 PADDB 2 PADDW 2 PADDW 2 PADDD 2 PADDD 2 PADDQ 2PADDQ 2ADDPS 1 ADDPD 2
subtractPSUBB 2 PSUBB 2 PSUBW 2 PSUBW 2 PSUBD 2 PSUBD 2 PSUBQ 2PSUBQ 2SUBPS 1 SUBPD 2
saturated add PADDSB 2 PADDUSB 2 PADDSW 2 PADDUSW 2
saturated subtract PSUBSB 2 PSUBUSB 2 PSUBSW 2 PSUBUSW 2
add-subtract ADDSUBPS 3 ADDSUBPD 3
horizontal addPHADDW 3.3PHADDW 3.3PHADDD 3.3PHADDD 3.3HADDPS 3HADDPD 3
multiply PMULLW 2 PMULLW 2 PMULLD 4.1 PMULLD 4.1 MULPS 1 MULPD 2
divide DIVPS 1 DIVPD 2
absolute value PABSB 3.3 PABSW 3.3 PABSD 3.3
minimumPMINSB 4.1 PMINUB 2 PMINSW 2 PMINUW 4.1 PMINSD 4.1 PMINUD 4.1 MINPS 1 MINPD 2
maximumPMAXSB 4.1 PMAXUB 2 PMAXSW 2 PMAXUW 4.1 PMAXSD 4.1 PMAXUD 4.1 MAXPS 1 MAXPD 2
approx reciprocalRCPPS 1
square rootSQRTPS 1SQRTPD 2
comparisonPCMPxxB 2PCMPxxB 2PCMPxxW 2PCMPxxW 2PCMPxxD 2PCMPxxD 2PCMPxxQ 4.2PCMPxxQ 4.2CMPxxxPS 1CMPxxxPD 2
bitwise andPAND 2PAND 2PAND 2PAND 2PAND 2PAND 2PAND 2PAND 2ANDPS 1ANDPD 2
bitwise and-notPANDN 2PANDN 2PANDN 2PANDN 2PANDN 2PANDN 2PANDN 2PANDN 2ANDNPS 1ANDNPD 2
bitwise orPOR 2POR 2POR 2POR 2POR 2POR 2POR 2POR 2ORPS 1ORPD 2
bitwise xorPXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2XORPS 1XORPD 2
bitwise testPTEST 4.1PTEST 4.1PTEST 4.1PTEST 4.1PTEST 4.1PTEST 4.1PTEST 4.1PTEST 4.1
load maskPMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2MOVMSKPS 1MOVMSKPD 2
shift leftPSLLW 2PSLLW 2PSLLD 2PSLLD 2PSLLQ 2PSLLQ 2
shift rightPSRAW 2PSRLW 2PSRAD 2PSRLD 2PSRLQ 2
unpack lowPUNPCKLBW 2PUNPCKLBW 2PUNPCKLWD 2PUNPCKLWD 2PUNPCKLDQ 2PUNPCKLDQ 2PUNPCKLQDQ 2PUNPCKLQDQ 2UNPCKLPS 1UNPCKLPD 2
unpack highPUNPCKHBW 2PUNPCKHBW 2PUNPCKHWD 2PUNPCKHWD 2PUNPCKHDQ 2PUNPCKHDQ 2PUNPCKHQDQ 2PUNPCKHQDQ 2UNPCKHPS 1UNPCKHPD 2
static shuffle§PSHUF[HL]W 2PSHUF[HL]W 2PSHUFD 2PSHUFD 2PSHUFD 2PSHUFD 2SHUFPS 1SHUFPD 2
variable shufflePSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3
static blendPBLENDW 4.1PBLENDW 4.1PBLENDW 4.1PBLENDW 4.1PBLENDW 4.1PBLENDW 4.1BLENDPS 4.1BLENDPD 4.1
variable blend#PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1BLENDVPS 4.1BLENDVPD 4.1

Notes:

  • The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
  • There are many more instructions that do not fit in this grid, but these are the most important ones to know.
  • * Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
  • † Equality (PCMPEQ_) and signed greater-than (PCMPGT_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPGT comparison in the opposite direction and invert the mask, either by using the "Bitwise NOT" idiom below or by reversing subsequent logic. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
  • ‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input.
  • § Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below.
  • ‖ 16-bit element shuffles only shuffle half of the register at a time.
  • ¶ Floating-point shuffles select the high element(s) from the source register and the low element(s) from the destination. To shuffle a single vector, use the same register for source and destination.
  • # Variable blends take the blend mask from XMM0 as an implicit operand.

Idioms


all integer types

Clear all bits

pxor xmm0, xmm0

Set all bits

pcmpeqb xmm0, xmm0

Bitwise NOT

pcmpeqb xmm0, xmm0
pxor xmm0, xmm1

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
pand   xmm1, xmm0
pandn  xmm0, xmm2
por    xmm0, xmm1

"Any" test

Using PTEST: (requires SSE 4.1)

ptest xmm0, xmm0
jnz any

Using PMOVMSKB:

pmovmskb eax, xmm0
test eax, eax
jnz any

"None" test

Using PTEST: (requires SSE 4.1)

ptest xmm0, xmm0
jz none

Using PMOVMSKB:

pmovmskb eax, xmm0
test eax, eax
jz none

"All" test

Using PTEST: (requires SSE 4.1)

pcmpeqb xmm1, xmm1
ptest xmm0, xmm1
jc all

Using PMOVMSKB:

pmovmskb eax, xmm0
not ax, ax
test ax, ax
jz all

int-4

Select nth component

Directly to a GPR: (requires SSE 4.1)

pextrd eax, xmm0, n

To low element of an XMM register:

pshufd xmm0, xmm1, n

Use movd eax, xmm0 to move the selected element to a GPR.

Gather four integers into a vector

Directly from GPRs: (requires SSE 4.1)

pinsrd xmm0, r8d, 0
pinsrd xmm0, r9d, 1
pinsrd xmm0, r10d, 2
pinsrd xmm0, r11d, 3

From low elements of XMM registers:

punpckldq xmm0, xmm1  ; xmm0 => 0 1 ? ?
punpckldq xmm2, xmm3  ; xmm2 => 2 3 ? ?
punpcklqdq xmm0, xmm2 ; xmm0 => 0 1 2 3

Use movd xmm0, eax to load the low element from a GPR.

Horizontal add without SSSE3

pshufd xmm0, xmm1, 0xb1 ; 1 0 3 2
paddd xmm0, xmm1
pshufd xmm1, xmm0, 0x0a ; 2 2 0 0
paddd xmm0, xmm1

Replace paddd with any 32-bit integer operation to perform it horizontally.

Special shuffles

pshufd selects the entire destination register from the source, unlike shufps and shufpd which select half from the destination, half from the source. Because of this, it doesn't need an initial movdqa to be useful in most cases. Therefore, pshufd will be a size win over these instructions (which do require a move to be nondestructive) unless you're destructively replacing your input.

ordercode
0 0 1 1punpckldq dst, dst
2 2 3 3punpckhdq dst, dst
0 1 0 1punpcklqdq dst, dst
2 3 2 3punpckhqdq dst, dst

float-4

Clear all bits

xorps xmm0, xmm0

Set all bits

cmpnltps xmm0, xmm0

cmpeqps cannot be used because NaN != NaN. Note that cmpnltps will set off the Invalid floating-point exception if the register being filled contains NaNs. If you care about the floating-point environment state, clear the register first:

xorps xmm0, xmm0
cmpeqps xmm0, xmm0

Bitwise NOT

cmpnltps xmm0, xmm0
xorps xmm0, xmm1

Select nth component

Element 0 is a no-op:

movss dst, src

Element 1:

movshdup dst, src

Element 2:

movhlps dst, src

Element 3:

movaps dst, src
shufps dst, dst, 0xff ; 3 3 3 3

Gather four floats into a vector

unpcklps xmm0, xmm1 ; xmm0 => 0 1 ? ?
unpcklps xmm2, xmm3 ; xmm2 => 2 3 ? ?
movlhps  xmm0, xmm2 ; xmm0 => 0 1 2 3

Broadcast float into four components

movaps dst, src
shufps dst, dst, n

Where n selects the element:

elementn
00x00
10x55
20xaa
30xff

Absolute value

Using a vector constant negative_zeroes_f = { -0.0f, -0.0f, -0.0f, -0.0f } from memory:

movaps xmm0, negative_zeroes_f
andnps xmm0, xmm1

Negation

Using a vector constant negative_zeroes_f = { -0.0, -0.0f, -0.0f, -0.0f } from memory:

movaps xmm0, negative_zeroes_f
xorps xmm0, xmm1

Horizontal add without SSE3

movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1 ; 1 0 3 2
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a ; 2 2 0 0
addps xmm0, xmm1

Replace addps with any vector instruction to perform it horizontally.

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
andps  xmm1, xmm0
andnps xmm0, xmm2
orps   xmm0, xmm1

"Any" test

movmskps eax, xmm0
test eax, eax
jnz any

"None" test

movmskps eax, xmm0
test eax, eax
jz none

"All" test

movmskps eax, xmm0
not eax, eax
test eax, 0xf
jz all

Special shuffles

ordercode
0 0 2 2movsldup dst, src
1 1 3 3movshdup dst, src
0 1 0 1movlhps dst, dst
2 3 2 3movhlps dst, dst
0 0 1 1unpcklps dst, dst
2 2 3 3unpckhps dst, dst

double-2

Clear all bits

xorpd xmm0, xmm0

Set all bits

cmpnltpd xmm0, xmm0

cmpeqpd cannot be used because NaN != NaN. Note that cmpnltpd will set off the Invalid floating-point exception if the register being filled contains NaNs. If you care about the floating-point environment state, clear the register first:

xorpd xmm0, xmm0
cmpeqpd xmm0, xmm0

Bitwise NOT

cmpnltpd xmm0, xmm0
xorpd xmm0, xmm1

Select nth component

Element 0 is a no-op:

movsd xmm0, xmm1

Element 1:

movapd xmm0, xmm1
unpckhpd xmm0, xmm0

Gather two doubles into a vector

unpcklpd xmm0, xmm1

Broadcast double into two components

Element 0:

movddup xmm0, xmm1

Element 1:

movapd xmm0, xmm1
unpckhpd xmm0, xmm0

Absolute value

Using a vector constant negative_zeroes_d = { -0.0, -0.0 } from memory:

movapd xmm0, negative_zeroes_d
andnpd xmm0, xmm1

Negation

Using a vector constant negative_zeroes_d = { -0.0, -0.0 } from memory:

movapd xmm0, negative_zeroes_d
xorpd xmm0, xmm1

Horizontal add without SSE3

movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1

Replace addsd with any double-precision instruction to perform it horizontally.

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
andpd  xmm1, xmm0
andnpd xmm0, xmm2
orpd   xmm0, xmm1

"Any" test

movmskpd eax, xmm0
test eax, eax
jnz any

"None" test

movmskpd eax, xmm0
test eax, eax
jz none

"All" test

movmskpd eax, xmm0
not eax, eax
test eax, 0x3
jz all

Special shuffles

ordercode
0 0unpcklpd dst, dst or movddup dst, src
1 1unpckhpd dst, dst

References

For full details, consult Intel's or AMD's instruction set reference documentation.

This revision created on Tue, 29 Sep 2009 00:29:54 by jckarter