Front Page Main: Concatenative languages: Interesting languages: Computer science: External: Meta:

SSE
This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections! Vector types The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:  char16
 uchar16
 short8
 ushort8
 int4
 uint4
 longlong2
 ulonglong2
 float4
 double2
Instruction set The number next to each instruction is the SSE version:  char16  uchar16  short8  ushort8  int4  uint4  longlong2  ulonglong2  float4  double2  move^{*}  MOVDQ[AU] 2  MOVDQ[AU] 2  MOVDQ[AU] 2  MOVDQ[AU] 2  MOVDQ[AU] 2  MOVDQ[AU] 2  MOVDQ[AU] 2  MOVDQ[AU] 2  MOV[AU]PS 1  MOV[AU]PD 2  add  PADDB 2  PADDB 2  PADDW 2  PADDW 2  PADDD 2  PADDD 2  PADDQ 2  PADDQ 2  ADDPS 1  ADDPD 2  subtract  PSUBB 2  PSUBB 2  PSUBW 2  PSUBW 2  PSUBD 2  PSUBD 2  PSUBQ 2  PSUBQ 2  SUBPS 1  SUBPD 2  saturated add  PADDSB 2  PADDUSB 2  PADDSW 2  PADDUSW 2        saturated subtract  PSUBSB 2  PSUBUSB 2  PSUBSW 2  PSUBUSW 2        addsubtract          ADDSUBPS 3  ADDSUBPD 3  horizontal add    PHADDW 3.3  PHADDW 3.3  PHADDD 3.3  PHADDD 3.3    HADDPS 3  HADDPD 3  multiply    PMULLW 2  PMULLW 2  PMULLD 4.1  PMULLD 4.1    MULPS 1  MULPD 2  divide          DIVPS 1  DIVPD 2  absolute value  PABSB 3.3   PABSW 3.3   PABSD 3.3       minimum  PMINSB 4.1  PMINUB 2  PMINSW 2  PMINUW 4.1  PMINSD 4.1  PMINUD 4.1    MINPS 1  MINPD 2  maximum  PMAXSB 4.1  PMAXUB 2  PMAXSW 2  PMAXUW 4.1  PMAXSD 4.1  PMAXUD 4.1    MAXPS 1  MAXPD 2  approx reciprocal          RCPPS 1   square root          SQRTPS 1  SQRTPD 2  comparison  PCMPxxB^{†} 2  PCMPxxB^{†} 2  PCMPxxW^{†} 2  PCMPxxW^{†} 2  PCMPxxD^{†} 2  PCMPxxD^{†} 2    CMPxxxPS^{‡} 1  CMPxxxPD^{‡} 2  bitwise and  PAND 2  PAND 2  PAND 2  PAND 2  PAND 2  PAND 2  PAND 2  PAND 2  ANDPS 1  ANDPD 2  bitwise or  POR 2  POR 2  POR 2  POR 2  POR 2  POR 2  POR 2  POR 2  ORPS 1  ORPD 2  bitwise xor  PXOR 2  PXOR 2  PXOR 2  PXOR 2  PXOR 2  PXOR 2  PXOR 2  PXOR 2  XORPS 1  XORPD 2  load mask  PMOVMSKB 2  PMOVMSKB 2  PMOVMSKB 2  PMOVMSKB 2  PMOVMSKB 2  PMOVMSKB 2  PMOVMSKB 2  PMOVMSKB 2  MOVMSKPS 1  MOVMSKPD 2  shift left    PSLLW 2  PSLLW 2  PSLLD 2  PSLLD 2  PSLLQ 2  PSLLQ 2    shift right    PSRAW 2  PSRLW 2  PSRAD 2  PSRLD 2   PSRLQ 2    unpack low  PUNPCKLBW 2  PUNPCKLBW 2  PUNPCKLWD 2  PUNPCKLWD 2  PUNPCKLDQ 2  PUNPCKLDQ 2  PUNPCKLQDQ 2  PUNPCKLQDQ 2  UNPCKLPS 1  UNPCKLPD 2  unpack high  PUNPCKHBW 2  PUNPCKHBW 2  PUNPCKHWD 2  PUNPCKHWD 2  PUNPCKHDQ 2  PUNPCKHDQ 2  PUNPCKHQDQ 2  PUNPCKHQDQ 2  UNPCKHPS 1  UNPCKHPD 2  static shuffle^{§}    PSHUF[HL]W^{‖} 2  PSHUF[HL]W^{‖} 2  PSHUFD 2  PSHUFD 2  PSHUFD 2  PSHUFD 2  SHUFPS^{¶} 1  SHUFPD^{¶} 2  dynamic shuffle  PSHUFB 3.3  PSHUFB 3.3  PSHUFB 3.3  PSHUFB 3.3  PSHUFB 3.3  PSHUFB 3.3  PSHUFB 3.3  PSHUFB 3.3   
Notes:  The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
 There are many more instructions that do not fit in this grid, but these are the most important ones to know.
 * Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
 † Equality (PCMPEQ_) and signed greaterthan (PCMPGT_) operations are provided for integer vectors. For signed lessthan, invert the operands. For signed less/greaterthanorequal, perform the PCMPEQ and PCMPGT comparisons and POR the results together. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
 ‡ The following floatingpoint comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greaterthan comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floatingpoint exception if a NaN is input.
 § Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below.
 ‖ 16bit element shuffles only shuffle half of the register at a time.
 ¶ Floatingpoint shuffles select the low elements from the source register and the high elements from the destination. To shuffle a single vector, use the same register for source and destination.
Idioms int4 Select nth component Gather four integers into a vector punpckldq xmm0, xmm1 ; xmm0 => ? ? 1 0
punpckldq xmm2, xmm3 ; xmm2 => ? ? 3 2
punpcklqdq xmm0, xmm2 ; xmm0 => 3 2 1 0 float4 Select nth component Gather four floats into a vector movss dst, src1
unpcklps dst, src2
unpcklps src3, src4
movlhps dst, src3 Broadcast float into four components movss dst, src
shufps dst, dst, 0x0 Absolute value Horizontal add with SSE2 movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a
addps xmm0, xmm1 Special shuffles order  code  0 0 2 2  movsldup dst, src  1 1 3 3  movshdup dst, src  0 1 0 1  movlhps dst, dst  2 3 2 3  movhlps dst, dst  0 0 1 1  unpcklps dst, dst  2 2 3 3  unpckhps dst, dst 
double2 Select nth component Gather two doubles into a vector movsd dst, src1
unpcklpd dst, src2 Broadcast double into two components movddup dst, src Absolute value Horizontal add with SSE2 movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1 Special shuffles order  code  0 0  unpcklpd dst, dst or movddup dst, src  1 1  unpckhpd dst, dst 
References For full details, consult Intel's or AMD's instruction set reference documentation.
This revision created on Mon, 28 Sep 2009 18:24:21 by jckarter
