Concatenative topics
Concatenative meta
Other languages
Meta
This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!
The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:
The number next to each instruction is the SSE version:
char-16 | uchar-16 | short-8 | ushort-8 | int-4 | uint-4 | longlong-2 | ulonglong-2 | float-4 | double-2 | |
move* | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOV[AU]PS 1 | MOV[AU]PD 2 |
add | PADDB 2 | PADDB 2 | PADDW 2 | PADDW 2 | PADDD 2 | PADDD 2 | PADDQ 2 | PADDQ 2 | ADDPS 1 | ADDPD 2 |
subtract | PSUBB 2 | PSUBB 2 | PSUBW 2 | PSUBW 2 | PSUBD 2 | PSUBD 2 | PSUBQ 2 | PSUBQ 2 | SUBPS 1 | SUBPD 2 |
saturated add | PADDSB 2 | PADDUSB 2 | PADDSW 2 | PADDUSW 2 | ||||||
saturated subtract | PSUBSB 2 | PSUBUSB 2 | PSUBSW 2 | PSUBUSW 2 | ||||||
add-subtract | ADDSUBPS 3 | ADDSUBPD 3 | ||||||||
horizontal add | PHADDW 3.3 | PHADDW 3.3 | PHADDD 3.3 | PHADDD 3.3 | HADDPS 3 | HADDPD 3 | ||||
multiply | PMULLW 2 | PMULLW 2 | PMULLD 4.1 | PMULLD 4.1 | MULPS 1 | MULPD 2 | ||||
divide | DIVPS 1 | DIVPD 2 | ||||||||
absolute value | PABSB 3.3 | PABSW 3.3 | PABSD 3.3 | |||||||
minimum | PMINSB 4.1 | PMINUB 2 | PMINSW 2 | PMINUW 4.1 | PMINSD 4.1 | PMINUD 4.1 | MINPS 1 | MINPD 2 | ||
maximum | PMAXSB 4.1 | PMAXUB 2 | PMAXSW 2 | PMAXUW 4.1 | PMAXSD 4.1 | PMAXUD 4.1 | MAXPS 1 | MAXPD 2 | ||
approx reciprocal | RCPPS 1 | |||||||||
square root | SQRTPS 1 | SQRTPD 2 | ||||||||
comparison | PCMPxxB† 2 | PCMPxxB† 2 | PCMPxxW† 2 | PCMPxxW† 2 | PCMPxxD† 2 | PCMPxxD† 2 | CMPxxxPS‡ 1 | CMPxxxPD‡ 2 | ||
bitwise and | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | ANDPS 1 | ANDPD 2 |
bitwise and-not | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | ANDNPS 1 | ANDNPD 2 |
bitwise or | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | ORPS 1 | ORPD 2 |
bitwise xor | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | XORPS 1 | XORPD 2 |
load mask | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | MOVMSKPS 1 | MOVMSKPD 2 |
shift left | PSLLW 2 | PSLLW 2 | PSLLD 2 | PSLLD 2 | PSLLQ 2 | PSLLQ 2 | ||||
shift right | PSRAW 2 | PSRLW 2 | PSRAD 2 | PSRLD 2 | PSRLQ 2 | |||||
unpack low | PUNPCKLBW 2 | PUNPCKLBW 2 | PUNPCKLWD 2 | PUNPCKLWD 2 | PUNPCKLDQ 2 | PUNPCKLDQ 2 | PUNPCKLQDQ 2 | PUNPCKLQDQ 2 | UNPCKLPS 1 | UNPCKLPD 2 |
unpack high | PUNPCKHBW 2 | PUNPCKHBW 2 | PUNPCKHWD 2 | PUNPCKHWD 2 | PUNPCKHDQ 2 | PUNPCKHDQ 2 | PUNPCKHQDQ 2 | PUNPCKHQDQ 2 | UNPCKHPS 1 | UNPCKHPD 2 |
static shuffle§ | PSHUF[HL]W‖ 2 | PSHUF[HL]W‖ 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | SHUFPS¶ 1 | SHUFPD¶ 2 | ||
variable shuffle | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | ||
static blend | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | BLENDPS 4.1 | BLENDPD 4.1 | ||
variable blend# | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | BLENDVPS 4.1 | BLENDVPD 4.1 |
Notes:
; mask is in xmm0 ; if-true is in xmm1 ; if-false is in xmm2 ; blended result is in xmm3 pand xmm1, xmm0 movdqa xmm3, xmm0 pandn xmm3, xmm2 por xmm3, xmm1
Directly to a GPR: (requires SSE 4.1)
pextrd eax, xmm0, n
To low element of an XMM register:
Element 0: no op
Element 1:
punpckhdq xmm0, xmm1
Element 2:
punpckhqdq xmm0, xmm1
Element 3:
pshufd xmm0, xmm1, 0xff ; 3 3 3 3
Use movd eax, xmm0
to move the selected element to a GPR.
Directly from GPRs: (requires SSE 4.1)
pinsrd xmm0, r8d, 0 pinsrd xmm0, r9d, 1 pinsrd xmm0, r10d, 2 pinsrd xmm0, r11d, 3
From low elements of XMM registers:
punpckldq xmm0, xmm1 ; xmm0 => ? ? 1 0 punpckldq xmm2, xmm3 ; xmm2 => ? ? 3 2 punpcklqdq xmm0, xmm2 ; xmm0 => 3 2 1 0
Use movd xmm0, eax
to load the low element from a GPR.
Element 0 is a no-op:
movss dst, src
Element 1:
movshdup dst, src
Element 2:
movhlps dst, src
Element 3:
movaps dst, src shufps dst, dst, 0xff ; 3 3 3 3
unpcklps xmm0, xmm1 ; xmm0 => ? ? 1 0 unpcklps xmm2, xmm3 ; xmm2 => ? ? 3 2 movlhps xmm0, xmm2 ; xmm0 => 3 2 1 0
movaps dst, src shufps dst, dst, n
Where n
selects the element:
element | n |
0 | 0x00 |
1 | 0x55 |
2 | 0xaa |
3 | 0xff |
movaps xmm1, xmm0 shufps xmm0, xmm1, 0xb1 ; 1 0 3 2 addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm0, xmm0, 0x0a ; 2 2 0 0 addps xmm0, xmm1
; mask is in xmm0 ; if-true is in xmm1 ; if-false is in xmm2 ; blended result is in xmm3 andps xmm1, xmm0 movaps xmm3, xmm0 andnps xmm3, xmm2 orps xmm3, xmm1
order | code |
0 0 2 2 | movsldup dst, src |
1 1 3 3 | movshdup dst, src |
0 1 0 1 | movlhps dst, dst |
2 3 2 3 | movhlps dst, dst |
0 0 1 1 | unpcklps dst, dst |
2 2 3 3 | unpckhps dst, dst |
unpcklpd xmm0, xmm1
Element 0:
movddup xmm0, xmm1
Element 1:
movapd xmm0, xmm1 unpckhpd xmm0, xmm0
movapd xmm1, xmm0 unpckhpd xmm1, xmm1 addsd xmm0, xmm1
; mask is in xmm0 ; if-true is in xmm1 ; if-false is in xmm2 ; blended result is in xmm3 andpd xmm1, xmm0 movapd xmm3, xmm0 andnpd xmm3, xmm2 orpd xmm3, xmm1
order | code |
0 0 | unpcklpd dst, dst or movddup dst, src |
1 1 | unpckhpd dst, dst |
For full details, consult Intel's or AMD's instruction set reference documentation.
This revision created on Mon, 28 Sep 2009 19:27:12 by jckarter