## Contents

Front Page

Main:

Concatenative languages:

Interesting languages:

Computer science:

External:

Meta:

# SSE

This is a quick reference for Intel's Streaming SIMD Extensions. Feel free to make additions or corrections!

# Vector types

The vector types here are named with the same convention as in Factor's SIMD library. It should be obvious what they mean:

• char-16
• uchar-16
• short-8
• ushort-8
• int-4
• uint-4
• longlong-2
• ulonglong-2
• float-4
• double-2

# Instruction set

The number next to each instruction is the SSE version:

Notes:

• The SSE2 integer SIMD mnemonics are the same as the MMX mnemonics; however, using them with SSE XMM registers rather than MMX MM registers generates different instructions.
• There are many more instructions that do not fit in this grid, but these are the most important ones to know.
• * Every move instruction has an aligned (A) and unaligned (U) form. Aligned is faster, but will trap if your address is not a multiple of 16 bytes.
• † Equality (PCMPEQ_) and signed greater-than (PCMPGT_) operations are provided for integer vectors. For signed less-than, invert the operands. For signed less/greater-than-or-equal, perform the PCMPEQ and PCMPGT comparisons and POR the results together. For unsigned tests, bias the inputs by PXORing 0x80, 0x8000, or 0x80000000 to the components.
• ‡ The following floating-point comparison operations are provided: EQ, LT, LE, UNORD, NEQ, NLT, NLE, and ORD. To get greater-than comparisons, invert the operands. LT, LE, NLT, and NLE are ordered comparisons and will raise the Invalid floating-point exception if a NaN is input.

# Idioms

## int-4

### Gather four integers into a vector

```punpckldq xmm0, xmm1  ; xmm0 => ? ? 1 0
punpckldq xmm2, xmm3  ; xmm2 => ? ? 3 2
punpcklqdq xmm0, xmm2 ; xmm0 => 3 2 1 0```

## float-4

### Gather four floats into a vector

```movss dst, src1
unpcklps dst, src2
unpcklps src3, src4
movlhps dst, src3```

### Broadcast float into four components

```movss dst, src
shufps dst, dst, 0x0```

### Absolute value

```movaps xmm1, xmm0
shufps xmm0, xmm1, 0xb1
movaps xmm1, xmm0
shufps xmm0, xmm0, 0x0a

### Special shuffles

 order code [{0 0 2 2}] [{movsldup dst, src}]

## double-2

### Gather two doubles into a vector

```movsd dst, src1
unpcklpd dst, src2```

### Broadcast double into two components

`movddup dst, src`

### Absolute value

```movapd xmm1, xmm0