SSE

This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!

Vector types

The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:

  • char-16
  • uchar-16
  • short-8
  • ushort-8
  • int-4
  • uint-4
  • longlong-2
  • ulonglong-2
  • float-4
  • double-2

Instruction set

The number next to each instruction is the SSE version:

char-16 uchar-16short-8 ushort-8int-4 uint-4 longlong-2ulonglong-2float-4 double-2
move*MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOV[AU]PS 1 MOV[AU]PD 2
add PADDB 2 PADDB 2 PADDW 2 PADDW 2 PADDD 2 PADDD 2 PADDQ 2PADDQ 2ADDPS 1 ADDPD 2
subtractPSUBB 2 PSUBB 2 PSUBW 2 PSUBW 2 PSUBD 2 PSUBD 2 PSUBQ 2PSUBQ 2SUBPS 1 SUBPD 2
saturated add PADDSB 2 PADDUSB 2 PADDSW 2 PADDUSW 2
saturated subtract PSUBSB 2 PSUBUSB 2 PSUBSW 2 PSUBUSW 2
add-subtract ADDSUBPS 3 ADDSUBPD 3
horizontal addPHADDW 3.3PHADDW 3.3PHADDD 3.3PHADDD 3.3HADDPS 3HADDPD 3
multiply PMULLW 2 PMULLW 2 PMULLD 4.1 PMULLD 4.1 MULPS 1 MULPD 2
divide DIVPS 1 DIVPD 2
absolute value PABSB 3.3 PABSW 3.3 PABSD 3.3
minimumPMINSB 4.1 PMINUB 2 PMINSW 2 PMINUW 4.1 PMINSD 4.1 PMINUD 4.1 MINPS 1 MINPD 2
maximumPMAXSB 4.1 PMAXUB 2 PMAXSW 2 PMAXUW 4.1 PMAXSD 4.1 PMAXUD 4.1 MAXPS 1 MAXPD 2
approx reciprocalRCPPS 1
square rootSQRTPS 1SQRTPD 2
comparisonPCMPxxB 2PCMPxxB 2PCMPxxW 2PCMPxxW 2PCMPxxD 2PCMPxxD 2CMPxxxPS 1CMPxxxPD 2
bitwise andPAND 2PAND 2PAND 2PAND 2PAND 2PAND 2PAND 2PAND 2ANDPS 1ANDPD 2
bitwise and-notPANDN 2PANDN 2PANDN 2PANDN 2PANDN 2PANDN 2PANDN 2PANDN 2ANDNPS 1ANDNPD 2
bitwise orPOR 2POR 2POR 2POR 2POR 2POR 2POR 2POR 2ORPS 1ORPD 2
bitwise xorPXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2XORPS 1XORPD 2
load maskPMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2PMOVMSKB 2MOVMSKPS 1MOVMSKPD 2
shift leftPSLLW 2PSLLW 2PSLLD 2PSLLD 2PSLLQ 2PSLLQ 2
shift rightPSRAW 2PSRLW 2PSRAD 2PSRLD 2PSRLQ 2
unpack lowPUNPCKLBW 2PUNPCKLBW 2PUNPCKLWD 2PUNPCKLWD 2PUNPCKLDQ 2PUNPCKLDQ 2PUNPCKLQDQ 2PUNPCKLQDQ 2UNPCKLPS 1UNPCKLPD 2
unpack highPUNPCKHBW 2PUNPCKHBW 2PUNPCKHWD 2PUNPCKHWD 2PUNPCKHDQ 2PUNPCKHDQ 2PUNPCKHQDQ 2PUNPCKHQDQ 2UNPCKHPS 1UNPCKHPD 2
static shuffle§PSHUF[HL]W 2PSHUF[HL]W 2PSHUFD 2PSHUFD 2PSHUFD 2PSHUFD 2SHUFPS 1SHUFPD 2
variable shufflePSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3PSHUFB 3.3
static blendPBLENDW 4.1PBLENDW 4.1PBLENDW 4.1PBLENDW 4.1PBLENDW 4.1PBLENDW 4.1BLENDPS 4.1BLENDPD 4.1
variable blend#PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1PBLENDVB 4.1BLENDVPS 4.1BLENDVPD 4.1

Notes:

  • The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
  • There are many more instructions that do not fit in this grid, but these are the most important ones to know.
  • * Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
  • † Equality (PCMPEQ_) and signed greater-than (PCMPGT_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPEQ and PCMPGT comparisons and POR the results together. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
  • ‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input.
  • § Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below.
  • ‖ 16-bit element shuffles only shuffle half of the register at a time.
  • ¶ Floating-point shuffles select the high element(s) from the source register and the low element(s) from the destination. To shuffle a single vector, use the same register for source and destination.
  • # Variable blends take the blend mask from XMM0 as an implicit operand.

Idioms

all integer types

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
pand   xmm1, xmm0
pandn  xmm0, xmm2
por    xmm0, xmm1

int-4

Select nth component

Directly to a GPR: (requires SSE 4.1)

pextrd eax, xmm0, n

To low element of an XMM register:

pshufd xmm0, xmm1, n

Use movd eax, xmm0 to move the selected element to a GPR.

Gather four integers into a vector

Directly from GPRs: (requires SSE 4.1)

pinsrd xmm0, r8d, 0
pinsrd xmm0, r9d, 1
pinsrd xmm0, r10d, 2
pinsrd xmm0, r11d, 3

From low elements of XMM registers:

punpckldq xmm0, xmm1  ; xmm0 => ? ? 1 0
punpckldq xmm2, xmm3  ; xmm2 => ? ? 3 2
punpcklqdq xmm0, xmm2 ; xmm0 => 3 2 1 0

Use movd xmm0, eax to load the low element from a GPR.

float-4

Select nth component

Element 0 is a no-op:

movss dst, src

Element 1:

movshdup dst, src

Element 2:

movhlps dst, src

Element 3:

movaps dst, src
shufps dst, dst, 0xff ; 3 3 3 3

Gather four floats into a vector

unpcklps xmm0, xmm1 ; xmm0 => ? ? 1 0
unpcklps xmm2, xmm3 ; xmm2 => ? ? 3 2
movlhps  xmm0, xmm2 ; xmm0 => 3 2 1 0

Broadcast float into four components

movaps dst, src
shufps dst, dst, n

Where n selects the element:

elementn
00x00
10x55
20xaa
30xff

Absolute value

Horizontal add with SSE2

movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1 ; 1 0 3 2
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a ; 2 2 0 0
addps xmm0, xmm1

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
andps  xmm1, xmm0
andnps xmm0, xmm2
orps   xmm0, xmm1

Special shuffles

ordercode
0 0 2 2movsldup dst, src
1 1 3 3movshdup dst, src
0 1 0 1movlhps dst, dst
2 3 2 3movhlps dst, dst
0 0 1 1unpcklps dst, dst
2 2 3 3unpckhps dst, dst

double-2

Select nth component

Gather two doubles into a vector

unpcklpd xmm0, xmm1

Broadcast double into two components

Element 0:

movddup xmm0, xmm1

Element 1:

movapd xmm0, xmm1
unpckhpd xmm0, xmm0

Absolute value

Horizontal add with SSE2

movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
andpd  xmm1, xmm0
andnpd xmm0, xmm2
orpd   xmm0, xmm1

Special shuffles

ordercode
0 0unpcklpd dst, dst or movddup dst, src
1 1unpckhpd dst, dst

References

For full details, consult Intel's or AMD's instruction set reference documentation.

This revision created on Mon, 28 Sep 2009 19:34:30 by jckarter