Concatenative topics
Concatenative meta
Other languages
Meta
This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!
The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:
The number next to each instruction is the SSE version:
char-16 | uchar-16 | short-8 | ushort-8 | int-4 | uint-4 | longlong-2 | ulonglong-2 | float-4 | double-2 | |
move* | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOV[AU]PS 1 | MOV[AU]PD 2 |
add | PADDB 2 | PADDB 2 | PADDW 2 | PADDW 2 | PADDD 2 | PADDD 2 | PADDQ 2 | PADDQ 2 | ADDPS 1 | ADDPD 2 |
subtract | PSUBB 2 | PSUBB 2 | PSUBW 2 | PSUBW 2 | PSUBD 2 | PSUBD 2 | PSUBQ 2 | PSUBQ 2 | SUBPS 1 | SUBPD 2 |
saturated add | PADDSB 2 | PADDUSB 2 | PADDSW 2 | PADDUSW 2 | ||||||
saturated subtract | PSUBSB 2 | PSUBUSB 2 | PSUBSW 2 | PSUBUSW 2 | ||||||
add-subtract | ADDSUBPS 3 | ADDSUBPD 3 | ||||||||
horizontal add | PHADDW 3.3 | PHADDW 3.3 | PHADDD 3.3 | PHADDD 3.3 | HADDPS 3 | HADDPD 3 | ||||
multiply | PMULLW 2 | PMULLW 2 | PMULLD 4.1 | PMULLD 4.1 | MULPS 1 | MULPD 2 | ||||
divide | DIVPS 1 | DIVPD 2 | ||||||||
absolute value | PABSB 3.3 | PABSW 3.3 | PABSD 3.3 | |||||||
minimum | PMINSB 4.1 | PMINUB 2 | PMINSW 2 | PMINUW 4.1 | PMINSD 4.1 | PMINUD 4.1 | MINPS 1 | MINPD 2 | ||
maximum | PMAXSB 4.1 | PMAXUB 2 | PMAXSW 2 | PMAXUW 4.1 | PMAXSD 4.1 | PMAXUD 4.1 | MAXPS 1 | MAXPD 2 | ||
approx reciprocal | RCPPS 1 | |||||||||
square root | SQRTPS 1 | SQRTPD 2 | ||||||||
comparison | PCMPxxB† 2 | PCMPxxB† 2 | PCMPxxW† 2 | PCMPxxW† 2 | PCMPxxD† 2 | PCMPxxD† 2 | CMPxxxPS‡ 1 | CMPxxxPD‡ 2 | ||
bitwise and | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | ANDPS 1 | ANDPD 2 |
bitwise and-not | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | ANDNPS 1 | ANDNPD 2 |
bitwise or | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | ORPS 1 | ORPD 2 |
bitwise xor | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | XORPS 1 | XORPD 2 |
bitwise test | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | ||
load mask | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | MOVMSKPS 1 | MOVMSKPD 2 |
shift left | PSLLW 2 | PSLLW 2 | PSLLD 2 | PSLLD 2 | PSLLQ 2 | PSLLQ 2 | ||||
shift right | PSRAW 2 | PSRLW 2 | PSRAD 2 | PSRLD 2 | PSRLQ 2 | |||||
unpack low | PUNPCKLBW 2 | PUNPCKLBW 2 | PUNPCKLWD 2 | PUNPCKLWD 2 | PUNPCKLDQ 2 | PUNPCKLDQ 2 | PUNPCKLQDQ 2 | PUNPCKLQDQ 2 | UNPCKLPS 1 | UNPCKLPD 2 |
unpack high | PUNPCKHBW 2 | PUNPCKHBW 2 | PUNPCKHWD 2 | PUNPCKHWD 2 | PUNPCKHDQ 2 | PUNPCKHDQ 2 | PUNPCKHQDQ 2 | PUNPCKHQDQ 2 | UNPCKHPS 1 | UNPCKHPD 2 |
static shuffle§ | PSHUF[HL]W‖ 2 | PSHUF[HL]W‖ 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | SHUFPS¶ 1 | SHUFPD¶ 2 | ||
variable shuffle | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | ||
static blend | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | BLENDPS 4.1 | BLENDPD 4.1 | ||
variable blend# | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | BLENDVPS 4.1 | BLENDVPD 4.1 |
Notes:
; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 ; blended result is in xmm0 pand xmm1, xmm0 pandn xmm0, xmm2 por xmm0, xmm1
a | b |
c | d |
e|
Directly to a GPR: (requires SSE 4.1)
pextrd eax, xmm0, n
To low element of an XMM register:
pshufd xmm0, xmm1, n
Use movd eax, xmm0
to move the selected element to a GPR.
Directly from GPRs: (requires SSE 4.1)
pinsrd xmm0, r8d, 0 pinsrd xmm0, r9d, 1 pinsrd xmm0, r10d, 2 pinsrd xmm0, r11d, 3
From low elements of XMM registers:
punpckldq xmm0, xmm1 ; xmm0 => 0 1 ? ? punpckldq xmm2, xmm3 ; xmm2 => 2 3 ? ? punpcklqdq xmm0, xmm2 ; xmm0 => 0 1 2 3
Use movd xmm0, eax
to load the low element from a GPR.
pshufd xmm0, xmm1, 0xb1 ; 1 0 3 2 paddd xmm0, xmm1 pshufd xmm1, xmm0, 0x0a ; 2 2 0 0 paddd xmm0, xmm1
Replace paddd
with any 32-bit integer operation to perform it horizontally.
pshufd
selects the entire destination register from the source, unlike shufps
and shufpd
which select half from the destination, half from the source. Because of this, it doesn't need an initial movdqa
to be useful in most cases. Therefore, pshufd
will be a size win over these instructions (which do require a move to be nondestructive) unless you're destructively replacing your input.
order | code |
0 0 1 1 | punpckldq dst, dst |
2 2 3 3 | punpckhdq dst, dst |
0 1 0 1 | punpcklqdq dst, dst |
2 3 2 3 | punpckhqdq dst, dst |
Element 0 is a no-op:
movss dst, src
Element 1:
movshdup dst, src
Element 2:
movhlps dst, src
Element 3:
movaps dst, src shufps dst, dst, 0xff ; 3 3 3 3
unpcklps xmm0, xmm1 ; xmm0 => 0 1 ? ? unpcklps xmm2, xmm3 ; xmm2 => 2 3 ? ? movlhps xmm0, xmm2 ; xmm0 => 0 1 2 3
movaps dst, src shufps dst, dst, n
Where n
selects the element:
element | n |
0 | 0x00 |
1 | 0x55 |
2 | 0xaa |
3 | 0xff |
movaps xmm1, xmm0 shufps xmm0, xmm1, 0xb1 ; 1 0 3 2 addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm0, xmm0, 0x0a ; 2 2 0 0 addps xmm0, xmm1
Replace addps
with any vector instruction to perform it horizontally.
; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 andps xmm1, xmm0 andnps xmm0, xmm2 orps xmm0, xmm1
order | code |
0 0 2 2 | movsldup dst, src |
1 1 3 3 | movshdup dst, src |
0 1 0 1 | movlhps dst, dst |
2 3 2 3 | movhlps dst, dst |
0 0 1 1 | unpcklps dst, dst |
2 2 3 3 | unpckhps dst, dst |
unpcklpd xmm0, xmm1
Element 0:
movddup xmm0, xmm1
Element 1:
movapd xmm0, xmm1 unpckhpd xmm0, xmm0
movapd xmm1, xmm0 unpckhpd xmm1, xmm1 addsd xmm0, xmm1
Replace addsd
with any double-precision instruction to perform it horizontally.
; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 ; blended result is in xmm0 andpd xmm1, xmm0 andnpd xmm0, xmm2 orpd xmm0, xmm1
order | code |
0 0 | unpcklpd dst, dst or movddup dst, src |
1 1 | unpckhpd dst, dst |
For full details, consult Intel's or AMD's instruction set reference documentation.
This revision created on Mon, 28 Sep 2009 20:53:26 by jckarter