Concatenative topics
Concatenative meta
Other languages
Meta
This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!
The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:
The number next to each instruction is the SSE version:
char-16 | uchar-16 | short-8 | ushort-8 | int-4 | uint-4 | longlong-2 | ulonglong-2 | float-4 | double-2 | |
move* | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOVDQ[AU] 2 | MOV[AU]PS 1 | MOV[AU]PD 2 |
add | PADDB 2 | PADDB 2 | PADDW 2 | PADDW 2 | PADDD 2 | PADDD 2 | PADDQ 2 | PADDQ 2 | ADDPS 1 | ADDPD 2 |
subtract | PSUBB 2 | PSUBB 2 | PSUBW 2 | PSUBW 2 | PSUBD 2 | PSUBD 2 | PSUBQ 2 | PSUBQ 2 | SUBPS 1 | SUBPD 2 |
saturated add | PADDSB 2 | PADDUSB 2 | PADDSW 2 | PADDUSW 2 | ||||||
saturated subtract | PSUBSB 2 | PSUBUSB 2 | PSUBSW 2 | PSUBUSW 2 | ||||||
add-subtract | ADDSUBPS 3 | ADDSUBPD 3 | ||||||||
horizontal add | PHADDW 3.3 | PHADDW 3.3 | PHADDD 3.3 | PHADDD 3.3 | HADDPS 3 | HADDPD 3 | ||||
multiply | PMULLW 2 | PMULLW 2 | PMULLD 4.1 | PMULLD 4.1 | MULPS 1 | MULPD 2 | ||||
divide | DIVPS 1 | DIVPD 2 | ||||||||
absolute value | PABSB 3.3 | PABSW 3.3 | PABSD 3.3 | |||||||
minimum | PMINSB 4.1 | PMINUB 2 | PMINSW 2 | PMINUW 4.1 | PMINSD 4.1 | PMINUD 4.1 | MINPS 1 | MINPD 2 | ||
maximum | PMAXSB 4.1 | PMAXUB 2 | PMAXSW 2 | PMAXUW 4.1 | PMAXSD 4.1 | PMAXUD 4.1 | MAXPS 1 | MAXPD 2 | ||
approx reciprocal | RCPPS 1 | |||||||||
square root | SQRTPS 1 | SQRTPD 2 | ||||||||
comparison | PCMPxxB† 2 | PCMPxxB† 2 | PCMPxxW† 2 | PCMPxxW† 2 | PCMPxxD† 2 | PCMPxxD† 2 | PCMPxxQ† 4.2 | PCMPxxQ† 4.2 | CMPxxxPS‡ 1 | CMPxxxPD‡ 2 |
bitwise and | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | PAND 2 | ANDPS 1 | ANDPD 2 |
bitwise and-not | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | PANDN 2 | ANDNPS 1 | ANDNPD 2 |
bitwise or | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | POR 2 | ORPS 1 | ORPD 2 |
bitwise xor | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | PXOR 2 | XORPS 1 | XORPD 2 |
bitwise test | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | PTEST 4.1 | ||
load mask | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | PMOVMSKB 2 | MOVMSKPS 1 | MOVMSKPD 2 |
shift left | PSLLW 2 | PSLLW 2 | PSLLD 2 | PSLLD 2 | PSLLQ 2 | PSLLQ 2 | ||||
shift right | PSRAW 2 | PSRLW 2 | PSRAD 2 | PSRLD 2 | PSRLQ 2 | |||||
unpack low | PUNPCKLBW 2 | PUNPCKLBW 2 | PUNPCKLWD 2 | PUNPCKLWD 2 | PUNPCKLDQ 2 | PUNPCKLDQ 2 | PUNPCKLQDQ 2 | PUNPCKLQDQ 2 | UNPCKLPS 1 | UNPCKLPD 2 |
unpack high | PUNPCKHBW 2 | PUNPCKHBW 2 | PUNPCKHWD 2 | PUNPCKHWD 2 | PUNPCKHDQ 2 | PUNPCKHDQ 2 | PUNPCKHQDQ 2 | PUNPCKHQDQ 2 | UNPCKHPS 1 | UNPCKHPD 2 |
static shuffle§ | PSHUF[HL]W‖ 2 | PSHUF[HL]W‖ 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | PSHUFD 2 | SHUFPS¶ 1 | SHUFPD¶ 2 | ||
variable shuffle | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | PSHUFB 3.3 | ||
static blend | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | PBLENDW 4.1 | BLENDPS 4.1 | BLENDPD 4.1 | ||
variable blend# | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | PBLENDVB 4.1 | BLENDVPS 4.1 | BLENDVPD 4.1 |
Notes:
pxor xmm0, xmm0
pcmpeqb xmm0, xmm0
pcmpeqb xmm0, xmm0 pxor xmm0, xmm1
; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 ; blended result is in xmm0 pand xmm1, xmm0 pandn xmm0, xmm2 por xmm0, xmm1
Using PTEST: (requires SSE 4.1)
ptest xmm0, xmm0 jnz any
Using PMOVMSKB:
pmovmskb eax, xmm0 test eax, eax jnz any
Using PTEST: (requires SSE 4.1)
ptest xmm0, xmm0 jz none
Using PMOVMSKB:
pmovmskb eax, xmm0 test eax, eax jz none
Using PTEST: (requires SSE 4.1)
pcmpeqb xmm1, xmm1 ptest xmm0, xmm1 jc all
Using PMOVMSKB:
pmovmskb eax, xmm0 not ax, ax test ax, ax jz all
Directly to a GPR: (requires SSE 4.1)
pextrd eax, xmm0, n
To low element of an XMM register:
pshufd xmm0, xmm1, n
Use movd eax, xmm0
to move the selected element to a GPR.
Directly from GPRs: (requires SSE 4.1)
pinsrd xmm0, r8d, 0 pinsrd xmm0, r9d, 1 pinsrd xmm0, r10d, 2 pinsrd xmm0, r11d, 3
From low elements of XMM registers:
punpckldq xmm0, xmm1 ; xmm0 => 0 1 ? ? punpckldq xmm2, xmm3 ; xmm2 => 2 3 ? ? punpcklqdq xmm0, xmm2 ; xmm0 => 0 1 2 3
Use movd xmm0, eax
to load the low element from a GPR.
pshufd xmm0, xmm1, 0xb1 ; 1 0 3 2 paddd xmm0, xmm1 pshufd xmm1, xmm0, 0x0a ; 2 2 0 0 paddd xmm0, xmm1
Replace paddd
with any 32-bit integer operation to perform it horizontally.
pshufd
selects the entire destination register from the source, unlike shufps
and shufpd
which select half from the destination, half from the source. Because of this, it doesn't need an initial movdqa
to be useful in most cases. Therefore, pshufd
will be a size win over these instructions (which do require a move to be nondestructive) unless you're destructively replacing your input.
order | code |
0 0 1 1 | punpckldq dst, dst |
2 2 3 3 | punpckhdq dst, dst |
0 1 0 1 | punpcklqdq dst, dst |
2 3 2 3 | punpckhqdq dst, dst |
xorps xmm0, xmm0
cmpnltps xmm0, xmm0
cmpeqps
cannot be used because NaN != NaN. Note that cmpnltps
will set off the Invalid floating-point exception if the register being filled contains NaNs. If you care about the floating-point environment state, clear the register first:
xorps xmm0, xmm0 cmpeqps xmm0, xmm0
cmpnltps xmm0, xmm0 xorps xmm0, xmm1
Element 0 is a no-op:
movss dst, src
Element 1:
movshdup dst, src
Element 2:
movhlps dst, src
Element 3:
movaps dst, src shufps dst, dst, 0xff ; 3 3 3 3
unpcklps xmm0, xmm1 ; xmm0 => 0 1 ? ? unpcklps xmm2, xmm3 ; xmm2 => 2 3 ? ? movlhps xmm0, xmm2 ; xmm0 => 0 1 2 3
movaps dst, src shufps dst, dst, n
Where n
selects the element:
element | n |
0 | 0x00 |
1 | 0x55 |
2 | 0xaa |
3 | 0xff |
Using a vector constant negative_zeroes_f = { -0.0f, -0.0f, -0.0f, -0.0f }
from memory:
movaps xmm0, negative_zeroes_f andnps xmm0, xmm1
Using a vector constant negative_zeroes_f = { -0.0, -0.0f, -0.0f, -0.0f }
from memory:
movaps xmm0, negative_zeroes_f xorps xmm0, xmm1
movaps xmm1, xmm0 shufps xmm0, xmm1, 0xb1 ; 1 0 3 2 addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm0, xmm0, 0x0a ; 2 2 0 0 addps xmm0, xmm1
Replace addps
with any vector instruction to perform it horizontally.
; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 ; blended result is in xmm0 andps xmm1, xmm0 andnps xmm0, xmm2 orps xmm0, xmm1
movmskps eax, xmm0 test eax, eax jnz any
movmskps eax, xmm0 test eax, eax jz none
movmskps eax, xmm0 not eax, eax test eax, 0xf jz all
order | code |
0 0 2 2 | movsldup dst, src |
1 1 3 3 | movshdup dst, src |
0 1 0 1 | movlhps dst, dst |
2 3 2 3 | movhlps dst, dst |
0 0 1 1 | unpcklps dst, dst |
2 2 3 3 | unpckhps dst, dst |
xorpd xmm0, xmm0
cmpnltpd xmm0, xmm0
cmpeqpd
cannot be used because NaN != NaN. Note that cmpnltpd
will set off the Invalid floating-point exception if the register being filled contains NaNs. If you care about the floating-point environment state, clear the register first:
xorpd xmm0, xmm0 cmpeqpd xmm0, xmm0
cmpnltpd xmm0, xmm0 xorpd xmm0, xmm1
Element 0 is a no-op:
movsd xmm0, xmm1
Element 1:
movapd xmm0, xmm1 unpckhpd xmm0, xmm0
unpcklpd xmm0, xmm1
Element 0:
movddup xmm0, xmm1
Element 1:
movapd xmm0, xmm1 unpckhpd xmm0, xmm0
Using a vector constant negative_zeroes_d = { -0.0, -0.0 }
from memory:
movapd xmm0, negative_zeroes_d andnpd xmm0, xmm1
Using a vector constant negative_zeroes_d = { -0.0, -0.0 }
from memory:
movapd xmm0, negative_zeroes_d xorpd xmm0, xmm1
movapd xmm1, xmm0 unpckhpd xmm1, xmm1 addsd xmm0, xmm1
Replace addsd
with any double-precision instruction to perform it horizontally.
; mask is in xmm0 (destroyed) ; if-true is in xmm1 (destroyed) ; if-false is in xmm2 ; blended result is in xmm0 andpd xmm1, xmm0 andnpd xmm0, xmm2 orpd xmm0, xmm1
movmskpd eax, xmm0 test eax, eax jnz any
movmskpd eax, xmm0 test eax, eax jz none
movmskpd eax, xmm0 not eax, eax test eax, 0x3 jz all
order | code |
0 0 | unpcklpd dst, dst or movddup dst, src |
1 1 | unpckhpd dst, dst |
For full details, consult Intel's or AMD's instruction set reference documentation.
This revision created on Tue, 29 Sep 2009 00:29:54 by jckarter