Multiply–accumulate operation

In computing, especially digital signal processing, the multiply–accumulate (MAC) or multiply-add (MAD) operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator (MAC unit); the operation itself is also often called a MAC or a MAD operation. The MAC operation modifies an accumulator a:

a\gets a+(b\times c)

When done with

DSPs

), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC).

Modern computers may contain a dedicated MAC, consisting of a multiplier implemented in

method of shifting and adding typical of earlier computers. Percy Ludgate was the first to conceive a MAC in his Analytical Machine of 1909,^[1] and the first to exploit a MAC for division (using multiplication seeded by reciprocal, via the convergent series

(1+ x) -1

). The first modern processors to be equipped with MAC units were digital signal processors, but the technique is now also common in general-purpose processors.^[2]^[3]^[4]^[5]

In floating-point arithmetic

When done with

distributive. (See Floating-point arithmetic § Accuracy problems

.) Therefore, it makes a difference to the result whether the multiply–add is performed with two roundings, or in one operation with a single rounding (a fused multiply–add).

IEEE 754-2008 specifies that it must be performed with one rounding, yielding a more accurate result.^[6]

Fused multiply–add

A fused multiply–add (FMA or fmadd)[7] is a floating-point multiply–add operation performed in one step (

fused operation

), with a single rounding. That is, where an unfused multiply–add would compute the product

b \times c

, round it to N significant bits, add the result to a, and round back to N significant bits, a fused multiply–add would compute the entire expression

a + (b \times c)

to its full precision before rounding the final result down to N significant bits.

A fast FMA can speed up and improve the accuracy of many computations that involve the accumulation of products:

Dot product
Matrix multiplication
Horner's rule
)
Newton's method for evaluating functions (from the inverse function)
artificial neural networks
Multiplication in double-double arithmetic

Fused multiply–add can usually be relied on to give more accurate results. However,

William Kahan has pointed out that it can give problems if used unthinkingly.^[8]

If

x 2 - y 2

is evaluated as

((x \times x) - y \times y)

(following Kahan's suggested notation in which redundant parentheses direct the compiler to round the

(x \times x)

term first) using fused multiply–add, then the result may be negative even when

x = y

due to the first multiplication discarding low significance bits. This could then lead to an error if, for instance, the square root of the result is then evaluated.

When implemented inside a microprocessor, an FMA can be faster than a multiply operation followed by an add. However, standard industrial implementations based on the original IBM RS/6000 design require a 2N-bit adder to compute the sum properly.^[9]

Another benefit of including this instruction is that it allows an efficient software implementation of division (see division algorithm) and square root (see methods of computing square roots) operations, thus eliminating the need for dedicated hardware for those operations.^[10]

Dot product instruction

Some machines combine multiple fused multiply add operations into a single step, e.g. performing a four-element dot-product on two 128-bit SIMD registers a0×b0 + a1×b1 + a2×b2 + a3×b3 with single cycle throughput.

Support

The FMA operation is included in

IEEE 754-2008

.

The

Horner's rule using a succession of multiply and add steps. Instruction descriptions do not specify whether the multiply and add are performed using a single FMA step.^[11]

This instruction has been a part of the VAX instruction set since its original 11/780 implementation in 1977.

The 1999 standard of the C programming language supports the FMA operation through the fma() standard math library function and the automatic transformation of a multiplication followed by an addition (contraction of floating-point expressions), which can be explicitly enabled or disabled with standard pragmas (#pragma STDC FP_CONTRACT). The GCC and Clang C compilers do such transformations by default for processor architectures that support FMA instructions. With GCC, which does not support the aforementioned pragma,^[12] this can be globally controlled by the -ffp-contract command line option.^[13]

The fused multiply–add operation was introduced as "multiply–add fused" in the IBM POWER1 (1990) processor,^[14] but has been added to numerous other processors since then:

HP PA-8000 (1996) and above
Hitachi SuperH SH-4
(1998)

SCE-Toshiba Emotion Engine
(1999)

Intel Itanium (2001)
STI
Cell
(2006)
SPARC64 VI
(2007) and above
(MIPS-compatible) Loongson-2F (2008)^[15]
Elbrus-8SV (2018)
x86 processors with FMA3 and/or FMA4 instruction set
- AMD
  Bulldozer
  (2011, FMA4 only)
- AMD Piledriver (2012, FMA3 and FMA4)^[16]
- AMD Steamroller (2014)
- AMD Excavator (2015)
- AMD Zen (2017, FMA3 only)
- Intel Haswell (2013, FMA3 only)^[17]
- Intel Skylake (2015, FMA3 only)
ARM processors with VFPv4 and/or NEONv2:
- ARM Cortex-M4F
  (2010)
- STM32 Cortex-M33 (VFMA operation)^[18]
- ARM Cortex-A5 (2012)
- ARM Cortex-A7
  (2013)
- ARM Cortex-A15
  (2012)
- Qualcomm Krait
  (2012)
- Apple A6 (2012)
- All
  ARMv8
  processors
  - Fujitsu A64FX has "Four-operand FMA with Prefix Instruction".
IBM z/Architecture (since 1998)
GPUs and GPGPU boards:
- AMD GPUs (2009) and newer
  - TeraScale 2 "Evergreen"-series based
  - Graphics Core Next-based
- Nvidia GPUs (2010) and newer
  - Fermi-based (2010)
  - Kepler-based (2012)
  - Maxwell-based (2014)
  - Pascal-based (2016)
  - Volta-based (2017)
- Intel GPUs since
  Sandy Bridge
- Intel MIC
  (2012)
- ARM Mali T600 Series
  (2012) and above
Vector Processors:
- NEC SX-Aurora TSUBASA
RISC-V instruction set (2010)

References

^ "The Feasibility of Ludgate's Analytical Machine". Archived from the original on 2019-08-07. Retrieved 2020-08-30.
doi:10.3390/app10249052
.

S2CID 14535090
.

S2CID 211264132
.

^ "mad - ps". Retrieved 2021-08-14.

^ Whitehead, Nathan; Fit-Florea, Alex (2011). "Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs" (PDF). nvidia. Retrieved 2013-08-31.

^ "fmadd instrs". IBM.

Kahan, William (1996-05-31). "IEEE Standard 754 for Binary Floating-Point Arithmetic"
.

^ Quinnell, Eric (May 2007). Floating-Point Fused Multiply–Add Architectures (PDF) (PhD thesis). Retrieved 2011-03-28.

CiteSeerX 10.1.1.85.9648
.

^ "VAX instruction of the week: POLY". Archived from the original on 2020-02-13.

^ "Bug 20785 - Pragma STDC * (C99 FP) unimplemented". gcc.gnu.org. Retrieved 2022-02-02.

^ "Optimize Options (Using the GNU Compiler Collection (GCC))". gcc.gnu.org. Retrieved 2022-02-02.

doi:10.1147/rd.341.0059.

^ "Godson-3 Emulates x86: New MIPS-Compatible Chinese Processor Has Extensions for x86 Translation".

^ Hollingsworth, Brent (October 2012). "New "Bulldozer" and "Piledriver" Instructions". AMD Developer Central.

^ "Intel adds 22nm octo-core 'Haswell' to CPU design roadmap". The Register. Archived from the original on 2012-02-17. Retrieved 2008-08-19.

^ "STM32 Cortex-M33 MCUs programming manual" (PDF). ST. Retrieved 2024-05-06.

v
t
e
Graphics processing unit
GPU
Desktop

Intel
Arc

GT

Nvidia
GeForce

Quadro

Tesla

Tegra

AMD
Radeon

Radeon Pro

Instinct

Matrox

InfiniteReality

NEC µPD7220

3dfx Voodoo

S3

Glaze3D

Apple silicon

Mobile

Adreno

Apple silicon

Mali

PowerVR

VideoCore

Vivante

Imageon

Intel 2700G

Architecture

Compute kernel

Fabrication
CMOS

FinFET

MOSFET

Graphics pipeline
Geometry

Vertex

HDR rendering

MAC

Rasterisation
Shading

Ray-tracing

SIMD
SIMT

Tessellation

T&L

Tiled rendering

Unified shader model

Components

Blitter

Geometry processor

Input–output memory management unit

Render output unit

Shader unit

Stream processor

Tensor unit

Texture mapping unit

Video display controller

Video processing unit

Memory

DMA

Framebuffer

SGRAM

GDDR

GDDR2

GDDR3

GDDR4

GDDR5

GDDR6

GDDR7

HBM
HBM2

HBM2E

HBM3

HBM-PIM

HBM3E

Memory bandwidth

Memory controller

Shared graphics memory

Texture memory

VRAM

Form factor

IP core

Discrete graphics

Clustering

Switching

External graphics

Integrated graphics

System on a chip

Performance

Clock rate

Display resolution

Fillrate
Pixel/s

Texel/s

FLOP/s

Frame rate

Performance per watt

Transistor count

Misc

2D
Scrolling

Sprite

Tile

3D
GI

Texture

ASIC

GPGPU

Graphics library

Hardware acceleration

Image processing
Compression

Parallel computing

Vector processor

Video coding
Codec

VLIW

Retrieved from "https://en.wikipedia.org/w/index.php?title=Multiply–accumulate_operation&oldid=1222526631"

[1] "The Feasibility of Ludgate's Analytical Machine". Archived from the original on 2019-08-07. Retrieved 2020-08-30.

[2] doi:10.3390/app10249052
.

[3] S2CID 14535090
.

[4] S2CID 211264132
.

[5] "mad - ps". Retrieved 2021-08-14.

[6] Whitehead, Nathan; Fit-Florea, Alex (2011). "Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs" (PDF). nvidia. Retrieved 2013-08-31.

[7] "fmadd instrs". IBM.

[8] Kahan, William (1996-05-31). "IEEE Standard 754 for Binary Floating-Point Arithmetic"
.

[9] Quinnell, Eric (May 2007). Floating-Point Fused Multiply–Add Architectures (PDF) (PhD thesis). Retrieved 2011-03-28.

[goldschmidt_algo-10] CiteSeerX 10.1.1.85.9648
.

[11] "VAX instruction of the week: POLY". Archived from the original on 2020-02-13.

[12] "Bug 20785 - Pragma STDC * (C99 FP) unimplemented". gcc.gnu.org. Retrieved 2022-02-02.

[13] "Optimize Options (Using the GNU Compiler Collection (GCC))". gcc.gnu.org. Retrieved 2022-02-02.

[14] doi:10.1147/rd.341.0059.

[15] "Godson-3 Emulates x86: New MIPS-Compatible Chinese Processor Has Extensions for x86 Translation".

[16] Hollingsworth, Brent (October 2012). "New "Bulldozer" and "Piledriver" Instructions". AMD Developer Central.

[17] "Intel adds 22nm octo-core 'Haswell' to CPU design roadmap". The Register. Archived from the original on 2012-02-17. Retrieved 2008-08-19.

[18] "STM32 Cortex-M33 MCUs programming manual" (PDF). ST. Retrieved 2024-05-06.

[1]

[2]

[3]

[4]

[5]

[6]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

In floating-point arithmetic

Fused multiply–add

Dot product instruction

Support

See also

References