Concatenative topics

Concatenative meta

Other languages

SSE

This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!

Vector types

The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:

char-16
uchar-16
short-8
ushort-8
int-4
uint-4
longlong-2
ulonglong-2
float-4
double-2

Instruction set

The number next to each instruction is the SSE version:

1: SSE
2: SSE2
3: SSE3
3.3: SSSE3
4.1: SSE4.1
4.2: SSE4.2

	char-16	uchar-16	short-8	ushort-8	int-4	uint-4	longlong-2	ulonglong-2	float-4	double-2
move^*	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOVDQ[AU] 2	MOV[AU]PS 1	MOV[AU]PD 2
add	PADDB 2	PADDB 2	PADDW 2	PADDW 2	PADDD 2	PADDD 2	PADDQ 2	PADDQ 2	ADDPS 1	ADDPD 2
subtract	PSUBB 2	PSUBB 2	PSUBW 2	PSUBW 2	PSUBD 2	PSUBD 2	PSUBQ 2	PSUBQ 2	SUBPS 1	SUBPD 2
saturated add	PADDSB 2	PADDUSB 2	PADDSW 2	PADDUSW 2
saturated subtract	PSUBSB 2	PSUBUSB 2	PSUBSW 2	PSUBUSW 2
add-subtract									ADDSUBPS 3	ADDSUBPD 3
horizontal add			PHADDW 3.3	PHADDW 3.3	PHADDD 3.3	PHADDD 3.3			HADDPS 3	HADDPD 3
multiply			PMULLW 2	PMULLW 2	PMULLD 4.1	PMULLD 4.1			MULPS 1	MULPD 2
divide									DIVPS 1	DIVPD 2
absolute value	PABSB 3.3		PABSW 3.3		PABSD 3.3
minimum	PMINSB 4.1	PMINUB 2	PMINSW 2	PMINUW 4.1	PMINSD 4.1	PMINUD 4.1			MINPS 1	MINPD 2
maximum	PMAXSB 4.1	PMAXUB 2	PMAXSW 2	PMAXUW 4.1	PMAXSD 4.1	PMAXUD 4.1			MAXPS 1	MAXPD 2
approx reciprocal									RCPPS 1
square root									SQRTPS 1	SQRTPD 2
comparison	PCMPxxB^† 2	PCMPxxB^† 2	PCMPxxW^† 2	PCMPxxW^† 2	PCMPxxD^† 2	PCMPxxD^† 2			CMPxxxPS^‡ 1	CMPxxxPD^‡ 2
bitwise and	PAND 2	PAND 2	PAND 2	PAND 2	PAND 2	PAND 2	PAND 2	PAND 2	ANDPS 1	ANDPD 2
bitwise and-not	PANDN 2	PANDN 2	PANDN 2	PANDN 2	PANDN 2	PANDN 2	PANDN 2	PANDN 2	ANDNPS 1	ANDNPD 2
bitwise or	POR 2	POR 2	POR 2	POR 2	POR 2	POR 2	POR 2	POR 2	ORPS 1	ORPD 2
bitwise xor	PXOR 2	PXOR 2	PXOR 2	PXOR 2	PXOR 2	PXOR 2	PXOR 2	PXOR 2	XORPS 1	XORPD 2
bitwise test	PTEST 4.1	PTEST 4.1	PTEST 4.1	PTEST 4.1	PTEST 4.1	PTEST 4.1	PTEST 4.1	PTEST 4.1
load mask	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	PMOVMSKB 2	MOVMSKPS 1	MOVMSKPD 2
shift left			PSLLW 2	PSLLW 2	PSLLD 2	PSLLD 2	PSLLQ 2	PSLLQ 2
shift right			PSRAW 2	PSRLW 2	PSRAD 2	PSRLD 2		PSRLQ 2
unpack low	PUNPCKLBW 2	PUNPCKLBW 2	PUNPCKLWD 2	PUNPCKLWD 2	PUNPCKLDQ 2	PUNPCKLDQ 2	PUNPCKLQDQ 2	PUNPCKLQDQ 2	UNPCKLPS 1	UNPCKLPD 2
unpack high	PUNPCKHBW 2	PUNPCKHBW 2	PUNPCKHWD 2	PUNPCKHWD 2	PUNPCKHDQ 2	PUNPCKHDQ 2	PUNPCKHQDQ 2	PUNPCKHQDQ 2	UNPCKHPS 1	UNPCKHPD 2
static shuffle^§			PSHUF[HL]W^‖ 2	PSHUF[HL]W^‖ 2	PSHUFD 2	PSHUFD 2	PSHUFD 2	PSHUFD 2	SHUFPS^¶ 1	SHUFPD^¶ 2
variable shuffle	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3	PSHUFB 3.3
static blend			PBLENDW 4.1	PBLENDW 4.1	PBLENDW 4.1	PBLENDW 4.1	PBLENDW 4.1	PBLENDW 4.1	BLENDPS 4.1	BLENDPD 4.1
variable blend^#	PBLENDVB 4.1	PBLENDVB 4.1	PBLENDVB 4.1	PBLENDVB 4.1	PBLENDVB 4.1	PBLENDVB 4.1	PBLENDVB 4.1	PBLENDVB 4.1	BLENDVPS 4.1	BLENDVPD 4.1

Notes:

The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
There are many more instructions that do not fit in this grid, but these are the most important ones to know.
* Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
† Equality (PCMPEQ_) and signed greater-than (PCMPGT_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPEQ and PCMPGT comparisons and POR the results together. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input.
§ Some shuffle patterns for some vector types can be achieved with specialized instructions that may have better performance or code size than the generalized shuffle instruction. See "Special shuffles" under each vector type below.
‖ 16-bit element shuffles only shuffle half of the register at a time.
¶ Floating-point shuffles select the high element(s) from the source register and the low element(s) from the destination. To shuffle a single vector, use the same register for source and destination.
# Variable blends take the blend mask from XMM0 as an implicit operand.

Idioms

all integer types

Clear all bits

pxor xmm0, xmm0

Set all bits

pcmpeqb xmm0, xmm0

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
pand   xmm1, xmm0
pandn  xmm0, xmm2
por    xmm0, xmm1

Any test

Using PTEST predicate: (requires SSE 4.1)

ptest xmm0, xmm0
jnz any

Using PMOVMSKB:

int-4

Select nth component

Directly to a GPR: (requires SSE 4.1)

pextrd eax, xmm0, n

To low element of an XMM register:

pshufd xmm0, xmm1, n

Use movd eax, xmm0 to move the selected element to a GPR.

Gather four integers into a vector

Directly from GPRs: (requires SSE 4.1)

pinsrd xmm0, r8d, 0
pinsrd xmm0, r9d, 1
pinsrd xmm0, r10d, 2
pinsrd xmm0, r11d, 3

From low elements of XMM registers:

punpckldq xmm0, xmm1  ; xmm0 => 0 1 ? ?
punpckldq xmm2, xmm3  ; xmm2 => 2 3 ? ?
punpcklqdq xmm0, xmm2 ; xmm0 => 0 1 2 3

Use movd xmm0, eax to load the low element from a GPR.

Horizontal add without SSSE3

pshufd xmm0, xmm1, 0xb1 ; 1 0 3 2
paddd xmm0, xmm1
pshufd xmm1, xmm0, 0x0a ; 2 2 0 0
paddd xmm0, xmm1

Replace paddd with any 32-bit integer operation to perform it horizontally.

Special shuffles

pshufd selects the entire destination register from the source, unlike shufps and shufpd which select half from the destination, half from the source. Because of this, it doesn't need an initial movdqa to be useful in most cases. Therefore, pshufd will be a size win over these instructions (which do require a move to be nondestructive) unless you're destructively replacing your input.

order	code
0 0 1 1	`punpckldq dst, dst`
2 2 3 3	`punpckhdq dst, dst`
0 1 0 1	`punpcklqdq dst, dst`
2 3 2 3	`punpckhqdq dst, dst`

float-4

Clear all bits

xorps xmm0, xmm0

Set all bits

cmpnltps xmm0, xmm0

cmpeqps cannot be used because NaN != NaN.

Select nth component

Element 0 is a no-op:

movss dst, src

Element 1:

movshdup dst, src

Element 2:

movhlps dst, src

Element 3:

movaps dst, src
shufps dst, dst, 0xff ; 3 3 3 3

Gather four floats into a vector

unpcklps xmm0, xmm1 ; xmm0 => 0 1 ? ?
unpcklps xmm2, xmm3 ; xmm2 => 2 3 ? ?
movlhps  xmm0, xmm2 ; xmm0 => 0 1 2 3

Broadcast float into four components

movaps dst, src
shufps dst, dst, n

Where n selects the element:

element	`n`
0	`0x00`
1	`0x55`
2	`0xaa`
3	`0xff`

Absolute value

Horizontal add without SSE3

movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1 ; 1 0 3 2
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a ; 2 2 0 0
addps xmm0, xmm1

Replace addps with any vector instruction to perform it horizontally.

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
andps  xmm1, xmm0
andnps xmm0, xmm2
orps   xmm0, xmm1

Special shuffles

order	code
0 0 2 2	`movsldup dst, src`
1 1 3 3	`movshdup dst, src`
0 1 0 1	`movlhps dst, dst`
2 3 2 3	`movhlps dst, dst`
0 0 1 1	`unpcklps dst, dst`
2 2 3 3	`unpckhps dst, dst`

double-2

Clear all bits

xorpd xmm0, xmm0

Set all bits

cmpnltpd xmm0, xmm0

cmpeqpd cannot be used because NaN != NaN.

Select nth component

Element 0 is a no-op:

movsd xmm0, xmm1

Element 1:

movapd xmm0, xmm1
unpckhpd xmm0, xmm0

Gather two doubles into a vector

unpcklpd xmm0, xmm1

Broadcast double into two components

Element 0:

movddup xmm0, xmm1

Element 1:

movapd xmm0, xmm1
unpckhpd xmm0, xmm0

Absolute value

Horizontal add without SSE3

movapd xmm1, xmm0
unpckhpd xmm1, xmm1
addsd xmm0, xmm1

Replace addsd with any double-precision instruction to perform it horizontally.

Blend without SSE 4.1

; mask is in xmm0 (destroyed)
; if-true is in xmm1 (destroyed)
; if-false is in xmm2
; blended result is in xmm0
andpd  xmm1, xmm0
andnpd xmm0, xmm2
orpd   xmm0, xmm1

Special shuffles

order	code
0 0	`unpcklpd dst, dst` or `movddup dst, src`
1 1	`unpckhpd dst, dst`

References

For full details, consult Intel's or AMD's instruction set reference documentation.

This revision created on Mon, 28 Sep 2009 21:07:01 by jckarter

Contents

SSE

Vector types

Instruction set

Idioms

all integer types

Clear all bits

Set all bits

Blend without SSE 4.1

Any test

int-4

Select nth component

Gather four integers into a vector

Horizontal add without SSSE3

Special shuffles

float-4

Clear all bits

Set all bits

Select nth component

Gather four floats into a vector

Broadcast float into four components

Absolute value

Horizontal add without SSE3

Blend without SSE 4.1

Special shuffles

double-2

Clear all bits

Set all bits

Select nth component

Gather two doubles into a vector

Broadcast double into two components

Absolute value

Horizontal add without SSE3

Blend without SSE 4.1

Special shuffles

References