SSE4
This article may be too technical for most readers to understand.(July 2019) |
SSE4 (Streaming SIMD Extensions 4) is a
Like other previous generation CPU SIMD instruction sets, SSE4 supports up to 16 registers, each 128-bits wide which can load four 32-bit integers, four 32-bit single precision floating point numbers, or two 64-bit double precision floating point numbers.[1] SIMD operations, such as vector element-wise addition/multiplication and vector scalar addition/multiplication, process multiple bytes of data in a single CPU instruction. The parallel operation packs noticeable increases in performance. SSE4.2 introduced new SIMD string operations, including an instruction to compare two string fragments of up to 16 bytes each.[1] SSE4.2 is a subset of SSE4 and it was released a few years after the initial release of SSE4.
SSE4 subsets
Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1 in some Intel documentation, is available in
Starting with
Name confusion
What is now known as
Intel is using the marketing term HD Boost to refer to SSE4.[8]
New instructions
Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications. It features a number of instructions whose action is determined by a constant field and a set of instructions that take XMM0 as an implicit third operand.
Several of these instructions are enabled by the single-cycle shuffle engine in Penryn. (Shuffle operations reorder bytes within a register.)
SSE4.1
These instructions were introduced with
Instruction | Description |
---|---|
MPSADBW
|
Compute eight offset sums of absolute differences, four at a time (i.e., |x0−y0|+|x1−y1|+|x2−y2|+|x3−y3|, |x0−y1|+|x1−y2|+|x2−y3|+|x3−y4|, ..., |x0−y7|+|x1−y8|+|x2−y9|+|x3−y10|); this operation is important for some HD codecs, and allows an 8×8 block difference to be computed in fewer than seven cycles.[9] One bit of a three-bit immediate operand indicates whether y0 .. y10 or y4 .. y14 should be used from the destination operand, the other two whether x0..x3, x4..x7, x8..x11 or x12..x15 should be used from the source. |
PHMINPOSUW
|
Sets the bottom unsigned 16-bit word of the destination to the smallest unsigned 16-bit word in the source, and the next-from-bottom to the index of that word in the source. |
PMULDQ
|
Packed 32-bit signed "long" multiplication, two (1st and 3rd) out of four packed integers multiplied giving two packed 64-bit results. |
PMULLD
|
Packed 32-bit signed "low" multiplication, four packed sets of integers multiplied giving four packed 32-bit results. |
DPPS , DPPD
|
Dot product for AOS (Array of Structs) data. This takes an immediate operand consisting of four (or two for DPPD) bits to select which of the entries in the input to multiply and accumulate, and another four (or two for DPPD) to select whether to put 0 or the dot-product in the appropriate field of the output.
|
BLENDPS , BLENDPD , BLENDVPS , BLENDVPD , PBLENDVB , PBLENDW
|
Conditional copying of elements in one location with another, based (for non-V form) on the bits in an immediate operand, and (for V form) on the bits in register XMM0. |
PMINSB , PMAXSB , PMINUW , PMAXUW , PMINUD , PMAXUD , PMINSD , PMAXSD
|
Packed minimum/maximum for different integer operand types |
ROUNDPS , ROUNDSS , ROUNDPD , ROUNDSD
|
Round values in a floating-point register to integers, using one of four rounding modes specified by an immediate operand |
INSERTPS , PINSRB , PINSRD /PINSRQ , EXTRACTPS , PEXTRB , PEXTRD/PEXTRQ
|
The INSERTPS and PINSR instructions read 8, 16 or 32 bits from an x86 register or memory location and inserts it into a field in the destination register given by an immediate operand. EXTRACTPS and PEXTR read a field from the source register and insert it into an x86 register or memory location. For example, PEXTRD eax, [xmm0], 1; EXTRACTPS [addr+4*eax], xmm1, 1 stores the first field of xmm1 in the address given by the first field of xmm0. |
PMOVSXBW , PMOVZXBW , PMOVSXBD , PMOVZXBD , PMOVSXBQ , PMOVZXBQ , PMOVSXWD , PMOVZXWD , PMOVSXWQ , PMOVZXWQ , PMOVSXDQ , PMOVZXDQ
|
Packed sign/zero extension to wider types |
PTEST
|
This is similar to the TEST instruction, in that it sets the Z flag to the result of an AND between its operands: ZF is set, if DEST AND SRC is equal to 0. Additionally it sets the C flag if (NOT DEST) AND SRC equals zero.
This is equivalent to setting the Z flag if none of the bits masked by SRC are set, and the C flag if all of the bits masked by SRC are set. |
PCMPEQQ
|
Quadword (64 bits) compare for equality |
PACKUSDW
|
Convert signed DWORDs into unsigned WORDs with saturation. |
MOVNTDQA
|
Efficient read from write-combining memory area into SSE register; this is useful for retrieving results from peripherals attached to the memory bus. |
SSE4.2
SSE4.2 added STTNI (String and Text New Instructions),
Windows 11 24H2 requires the CPU to support SSE4.2, otherwise the Windows kernel is unbootable.[12]
Instruction | Description |
---|---|
CRC32
|
Accumulate |
PCMPESTRI
|
Packed Compare Explicit Length Strings, Return Index |
PCMPESTRM
|
Packed Compare Explicit Length Strings, Return Mask |
PCMPISTRI
|
Packed Compare Implicit Length Strings, Return Index |
PCMPISTRM
|
Packed Compare Implicit Length Strings, Return Mask |
PCMPGTQ
|
Compare Packed Signed 64-bit data For Greater Than |
POPCNT
and LZCNT
These instructions operate on integer rather than SSE registers, because they are not SIMD instructions, but appear at the same time and although introduced by AMD with the SSE4a instruction set, they are counted as separate extensions with their own dedicated CPUID bits to indicate support. Intel implements POPCNT
beginning with the
AMD calls this pair of instructions
Instruction | Description |
---|---|
POPCNT
|
Population count (count number of bits set to 1). Support is indicated via the CPUID.01H:ECX.POPCNT[Bit 23] flag.[15] |
LZCNT
|
Leading zero count. Support is indicated via the CPUID.80000001H:ECX.ABM[Bit 5] flag.[16]
|
The encoding of LZCNT
takes the same encoding path as the encoding of the BSR
(bit scan reverse) instruction. This results in an issue where LZCNT
called on some CPUs not supporting it, such as Intel CPUs prior to Haswell, may incorrectly execute the BSR
operation instead of raising an invalid instruction exception. This is an issue as the result values of LZCNT
and BSR
are different.
Trailing zeros can be counted using the BSF
(bit scan forward) or TZCNT
instructions.
Windows 11 24H2 requires the CPU to support POPCNT
, otherwise the Windows kernel is unbootable.[17]
SSE4a
The SSE4a instruction group was introduced in AMD's
Instruction | Description |
---|---|
EXTRQ /INSERTQ
|
Combined mask-shift instructions.[18] |
MOVNTSD /MOVNTSS
|
Scalar streaming store instructions.[19] |
Supporting CPUs
- Intel
- Silvermontprocessors (SSE4.1, SSE4.2 and
POPCNT
supported) - Goldmont processors (SSE4.1, SSE4.2 and
POPCNT
supported) - Goldmont Plus processors (SSE4.1, SSE4.2 and
POPCNT
supported) - Tremont processors (SSE4.1, SSE4.2 and
POPCNT
supported) - Penryn processors (SSE4.1 supported, except Pentium Dual-Core and Celeron)
- Nehalem processors and Westmere processors (SSE4.1, SSE4.2 and
POPCNT
supported, except Pentium and Celeron) - )
- Haswell processors and newer (SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported)
- AMD
- K10-based processors (SSE4a,
POPCNT
andLZCNT
supported) - "Cat" low-power processors
- Bobcat-basedprocessors (SSE4a,
POPCNT
andLZCNT
supported) - Jaguar-based processors and newer (SSE4a, SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported) - Puma-based processors and newer (SSE4a, SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported)
- "Heavy Equipment" processors (SSE4a, SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported)- Bulldozer-based processors
- Piledriver-based processors[20]
- Steamroller-based processors
- Excavator-based processors and newer
- Zen-based processors (SSE4a, SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported) - Zen+-based processors (SSE4a, SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported) - Zen2-based processors (SSE4a, SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported) - Zen3-based processors (SSE4a, SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported) - Zen4-based processors (SSE4a, SSE4.1, SSE4.2,
POPCNT
andLZCNT
supported)
- K10-based processors (SSE4a,
- VIA
- Zhaoxin
- ZX-C processors and newer (SSE4.1, SSE4.2 supported)
References
- ^ a b c Intel Streaming SIMD Extensions 4 (SSE4) Instruction Set Innovation Archived May 30, 2009, at the Wayback Machine, Intel.
- ^ Tuning for Intel SSE4 for the 45nm Next Generation Intel Core Microarchitecture Archived March 8, 2021, at the Wayback Machine, Intel.
- ^ "Intel SSE4 Programming Reference" (PDF). Archived (PDF) from the original on February 15, 2020. Retrieved December 26, 2014.
- ^ ""Barcelona" Processor Feature: SSE Misaligned Access". AMD. Archived from the original on August 9, 2016. Retrieved March 3, 2015.
- ^ "Inside Intel Nehalem Microarchitecture". Archived from the original on April 2, 2015. Retrieved March 3, 2015.
- ^ My Experience With "Conroe" Archived October 15, 2013, at the Wayback Machine, DailyTech
- ^ Extending the World’s Most Popular Processor Architecture Archived November 24, 2011, at the Wayback Machine, Intel
- ^ "Intel - Data Center Solutions, IOT, and PC Innovation". Intel. Archived from the original on February 7, 2013. Retrieved September 17, 2009.
- ^ Motion Estimation with Intel Streaming SIMD Extensions 4 (Intel SSE4) Archived June 16, 2018, at the Wayback Machine, Intel.
- ^ "Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)". Archived from the original on June 17, 2018. Retrieved February 6, 2012.
- ^ "XML Parsing Accelerator with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)". Archived from the original on June 17, 2018. Retrieved February 6, 2012.
- ^ Klotz, Aaron (April 24, 2024). "Microsoft blocks some PCs from Windows 11 24H2 — CPU must support SSE4.2 or the OS will not boot". Tom's Hardware. Retrieved April 29, 2024.
- ^ Intel SSE4 Programming Reference Archived February 15, 2020, at the Wayback Machine p. 61. See also RFC 3385 Archived June 19, 2008, at the Wayback Machine for discussion of the CRC32C polynomial.
- ^ Fast, Parallelized CRC Computation Using the Nehalem CRC32 Instruction — Dr. Dobbs, April 12, 2011
- ^ Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2B: Instruction Set Reference, N–Z Archived March 8, 2011, at the Wayback Machine.
- ^ a b "AMD CPUID Specification" (PDF). Archived (PDF) from the original on November 1, 2013. Retrieved October 30, 2013.
- ^ Sen, Sayan (March 17, 2024). "Microsoft fixes a misfired PopCnt block but Windows 11 24H2 requirements may be here to stay". Neowin. Retrieved March 17, 2024.
- ^ Rahul Chaturvedi (September 17, 2007). ""Barcelona" Processor Feature: SSE4a Instruction Set". Archived from the original on October 25, 2013.
- ^ Rahul Chaturvedi (October 2, 2007). ""Barcelona" Processor Feature: SSE4a, part 2". Archived from the original on October 25, 2013.
- ^ "AMD FX-Series FX-6300 - FD6300WMW6KHK / FD6300WMHKBOX". Archived from the original on August 17, 2017. Retrieved October 9, 2015.
External links
- SSE4 Programming Reference by Intel
- PCMPSTR calculator for the SSE 4.2 string instructions archived at Ghostarchive.org at May 10, 2022