SSE

This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!

Vector types

The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:

  • char-16
  • uchar-16
  • short-8
  • ushort-8
  • int-4
  • uint-4
  • longlong-2
  • ulonglong-2
  • float-4
  • double-2

Instruction set

The number next to each instruction is the SSE version:

char-16 uchar-16short-8 ushort-8int-4 uint-4 longlong-2ulonglong-2float-4 double-2
move MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOVDQ[AU] 2 MOV[AU]PS 1 MOV[AU]PD 2
add PADDB 2 PADDB 2 PADDW 2 PADDW 2 PADDD 2 PADDD 2 PADDQ 2PADDQ 2ADDPS 1 ADDPD 2
subtractPSUBB 2 PSUBB 2 PSUBW 2 PSUBW 2 PSUBD 2 PSUBD 2 PSUBQ 2PSUBQ 2SUBPS 1 SUBPD 2
saturated add PADDSB 2 PADDUSB 2 PADDSW 2 PADDUSW 2
saturated subtract PSUBSB 2 PSUBUSB 2 PSUBSW 2 PSUBUSW 2
add-subtract ADDSUBPS 3 ADDSUBPD 3
horizontal addPHADDW 3.3PHADDW 3.3PHADDD 3.3PHADDD 3.3HADDPS 3HADDPS 3
multiply PMULLW 2 PMULLW 2 PMULLD 4.1 PMULLD 4.1 MULPS 1 MULPD 2
divide DIVPS 1 DIVPD 2
absolute value PABSB 3.3 PABSW 3.3 PABSD 3.3
minimumPMINSB 4.1 PMINUB 2 PMINSW 2 PMINUW 4.1 PMINSD 4.1 PMINUD 4.1 MINPS 1 MINPD 2
maximumPMAXSB 4.1 PMAXUB 2 PMAXSW 2 PMAXUW 4.1 PMAXSD 4.1 PMAXUD 4.1 MAXPS 1 MAXPD 2
approx reciprocalRCPPS 1
square rootSQRTPS 1SQRTPD 2
bitwise andPAND 2PAND 2PAND 2PAND 2PAND 2PAND 2PAND 2PAND 2ANDPS 1ANDPD 2
bitwise orPOR 2POR 2POR 2POR 2POR 2POR 2POR 2POR 2ORPS 1ORPD 2
bitwise xorPXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2PXOR 2XORPS 1XORPD 2
shift leftPSLLW 2PSLLW 2PSLLD 2PSLLD 2PSLLQ 2PSLLQ 2
shift rightPSRAW 2PSRLW 2PSRAD 2PSRLD 2PSRLQ 2
unpack lowPUNPCKLBW 2PUNPCKLBW 2PUNPCKLWD 2PUNPCKLWD 2PUNPCKLDQ 2PUNPCKLDQ 2PUNPCKLQDQ 2PUNPCKLQDQ 2UNPCKLPS 1UNPCKLPD 2
unpack highPUNPCKHBW 2PUNPCKHBW 2PUNPCKHWD 2PUNPCKHWD 2PUNPCKHDQ 2PUNPCKHDQ 2PUNPCKHQDQ 2PUNPCKHQDQ 2UNPCKHPS 1UNPCKHPD 2
static shufflePSHUFHW/PSHUFLW 2PSHUFHW 2/PSHUFLW 2PSHUFD 2PSHUFD 2PSHUFD 2PSHUFD 2SHUFPS 1SHUFPD 2

Notes:

  • The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
  • There are many more instructions that do not fit in this grid, but these are the most important ones to know.
  • Every move instruction has an aligned and unaligned form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.

Idioms

int-4

Select nth component

Gather four integers into a vector

punpckldq xmm0, xmm1  ; xmm0 => ? ? 1 0
punpckldq xmm2, xmm3  ; xmm2 => ? ? 3 2
punpcklqdq xmm0, xmm2 ; xmm0 => 3 2 1 0

float-4

Select nth component

Gather four floats into a vector

movss dst, src1
unpcklps dst, src2
unpcklps src3, src4
movlhps dst, src3

Broadcast float into four components

movss dst, src
shufps dst, dst, 0x0

Absolute value

Horizontal add with SSE2

movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a
addps xmm0, xmm1

double-2

Select nth component

Gather two doubles into a vector

movsd dst, src1
unpcklpd dst, src2

Broadcast double into two components

movsd dst, src
unpcklpd dst, dst

Absolute value

Horizontal add with SSE2

movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1

References

For full details, consult Intel's or AMD's instruction set reference documentation.

This revision created on Mon, 28 Sep 2009 17:10:11 by jckarter