Explicit data graph execution

This article is about the instruction set architecture type. For the digital mobile phone technology, see

Intel x86

line. EDGE combines many individual instructions into a larger group known as a "hyperblock". Hyperblocks are designed to be able to easily run in parallel.

Parallelism of modern CPU designs generally starts to plateau at about eight internal units and from one to four "cores", EDGE designs intend to support hundreds of internal units and offer processing speeds hundreds of times greater than existing designs. Major development of the EDGE concept had been led by the University of Texas at Austin under DARPA's Polymorphous Computing Architectures program, with the stated goal of producing a single-chip CPU design with 1 TFLOPS performance by 2012, which has yet to be realized as of 2018.^[1]

Traditional designs

This article possibly contains original research. Please improve it by verifying the claims made and adding inline citations. Statements consisting only of original research should be removed. (June 2018) (Learn how and when to remove this message)

Almost all computer programs consist of a series of instructions that convert data from one form to another. Most instructions require several internal steps to complete an operation. Over time, the relative performance and cost of the different steps have changed dramatically, resulting in several major shifts in ISA design.

CISC to RISC

In the 1960s

MOS 6502 has eight instructions (opcodes) for performing addition, differing only in where they collect their operands.^[2]

Actually making these instructions work required circuitry in the CPU, which was a significant limitation in early designs and required designers to select just those instructions that were really needed. In 1964,

CISC

(Complex Instruction Set Computing).

In 1975 IBM started a project to develop a

telephone switch that required performance about three times that of their fastest contemporary computers. To reach this goal, the development team began to study the massive amount of performance data IBM had collected over the last decade. This study demonstrated that the complex ISA was in fact a significant problem; because only the most basic instructions were guaranteed to be implemented in hardware, compilers ignored the more complex ones that only ran in hardware on certain machines. As a result, the vast majority of a program's time was being spent in only five instructions. Further, even when the program called one of those five instructions, the microcode required a finite time to decode it, even if it was just to call the internal hardware. On faster machines, this overhead was considerable.^[4]

Their work, known at the time as the

RISC (Reduced Instruction Set Computing) concept. Microcode was removed, and only the most basic versions of any given instruction were put into the CPU. Any more complex code was left to the compiler. The removal of so much circuitry, about 1⁄3 of the transistors in the Motorola 68000 for instance, allowed the CPU to include more registers, which had a direct impact on performance. By the mid-1980s, further developed versions of these basic concepts were delivering performance as much as 10 times that of the fastest CISC designs, in spite of using less-developed fabrication.^[4]

Internal parallelism

In the 1990s the chip design and fabrication process grew to the point where it was possible to build a commodity processor with every potential feature built into it. Units that were previously on separate chips, like

floating point units and memory management units, were now able to be combined onto the same die, producing all-in one designs. This allows different types of instructions to be executed at the same time, improving overall system performed. In the later 1990s, single instruction, multiple data (SIMD) units were also added, and more recently, AI accelerators

While these additions improve overall system performance, they do not improve the performance of programs which are primarily operating on basic logic and

superscalar

". In any program there are instructions that work on unrelated data, so by adding more functional units these instructions can be run at the same time. A new portion of the CPU, the scheduler, looks for these independent instructions and feeds them into the units, taking their outputs and re-ordering them so externally it appears they ran in succession.

The amount of parallelism that can be extracted in superscalar designs is limited by the number of instructions that the scheduler can examine for interdependencies. Examining a greater number of instructions can improve the chance of finding an instruction that can be run in parallel, but only at the cost of increasing the complexity of the scheduler itself. Despite massive efforts, CPU designs using classic RISC or CISC ISA's plateaued by the late 2000s. Intel's Haswell designs of 2013 have a total of eight dispatch units,^[5] and adding more results in significantly complicating design and increasing power demands.^[6]

Additional performance can be wrung from systems by examining the instructions to find ones that operate on different types of data and adding units dedicated to that sort of data; this led to the introduction of on-board

floating point units in the 1980s and 90s and, more recently, single instruction, multiple data

(SIMD) units. The drawback to this approach is that it makes the CPU less generic; feeding the CPU with a program that uses almost all floating point instructions, for instance, will bog the FPUs while the other units sit idle.

A more recent problem in modern CPU designs is the delay talking to the registers. In general terms the size of the CPU die has remained largely the same over time, while the size of the units within the CPU has grown much smaller as more and more units were added. That means that the relative distance between any one function unit and the global register file has grown over time. Once introduced in order to avoid delays in talking to main memory, the global register file has itself become a delay that is worth avoiding.

A new ISA?

Just as the delays talking to memory while its price fell suggested a radical change in ISA (Instruction Set Architecture) from CISC to RISC, designers are considering whether the problems scaling in parallelism and the increasing delays talking to registers demands another switch in basic ISA.

Among the ways to introduce a new ISA are the very long instruction word (VLIW) architectures, typified by the Itanium. VLIW moves the scheduler logic out of the CPU and into the compiler, where it has much more memory and longer timelines to examine the instruction stream. This static placement, static issue execution model works well when all delays are known, but in the presence of cache latencies, filling instruction words has proven to be a difficult challenge for the compiler.^[7] An instruction that might take five cycles if the data is in the cache could take hundreds if it is not, but the compiler has no way to know whether that data will be in the cache at runtime – that's determined by overall system load and other factors that have nothing to do with the program being compiled.

The key performance bottleneck in traditional designs is that the data and the instructions that operate on them are theoretically scattered about memory. Memory performance dominates overall performance, and classic dynamic placement, dynamic issue designs seem to have reached the limit of their performance capabilities. VLIW uses a static placement, static issue model, but has proven difficult to master because the runtime behavior of programs is difficult to predict and properly schedule in advance.

EDGE

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Explicit data graph execution" – news · newspapers · books · scholar · JSTOR (June 2018) (Learn how and when to remove this message)

Theory

EDGE architectures are a new class of ISA's based on a static placement, dynamic issue design. EDGE systems compile source code into a form consisting of statically-allocated hyperblocks containing many individual instructions, hundreds or thousands. These hyperblocks are then scheduled dynamically by the CPU. EDGE thus combines the advantages of the VLIW concept of looking for independent data at compile time, with the superscalar RISC concept of executing the instructions when the data for them becomes available.

In the vast majority of real-world programs, the linkage of data and instructions is both obvious and explicit. Programs are divided into small blocks referred to as

high level language is converted into the processor's much simpler ISA. But this information is so useful that modern compilers have generalized the concept as the "basic block", attempting to identify them within programs while they optimize memory access through the registers. A block of instructions does not have control statements but can have predicated instructions. The dataflow graph

is encoded using these blocks, by specifying the flow of data from one block of instructions to another, or to some storage area.

The basic idea of EDGE is to directly support and operate on these blocks at the ISA level. Since basic blocks access memory in well-defined ways, the processor can load up related blocks and schedule them so that the output of one block feeds directly into the one that will consume its data. This eliminates the need for a global register file, and simplifies the compiler's task in scheduling access to the registers by the program as a whole – instead, each basic block is given its own local registers and the compiler optimizes access within the block, a much simpler task.

EDGE systems bear a strong resemblance to

dataflow languages

from the 1960s–1970s, and again in the 1990s. Dataflow computers execute programs according to the "dataflow firing rule", which stipulates that an instruction may execute at any time after its operands are available. Due to the isolation of data, similar to EDGE, dataflow languages are inherently parallel, and interest in them followed the more general interest in massive parallelism as a solution to general computing problems. Studies based on existing CPU technology at the time demonstrated that it would be difficult for a dataflow machine to keep enough data near the CPU to be widely parallel, and it is precisely this bottleneck that modern fabrication techniques can solve by placing hundreds of CPU's and their memory on a single die.

Another reason that dataflow systems never became popular is that compilers of the era found it difficult to work with common imperative languages like C++. Instead, most dataflow systems used dedicated languages like Prograph, which limited their commercial interest. A decade of compiler research has eliminated many of these problems, and a key difference between dataflow and EDGE approaches is that EDGE designs intend to work with commonly used languages.

CPUs

An EDGE-based CPU would consist of one or more small block engines with their own local registers; realistic designs might have hundreds of these units. The units are interconnected to each other using dedicated inter-block communication links. Due to the information encoded into the block by the compiler, the scheduler can examine an entire block to see if its inputs are available and send it into an engine for execution – there is no need to examine the individual instructions within.

With a small increase in complexity, the scheduler can examine multiple blocks to see if the outputs of one are fed in as the inputs of another, and place these blocks on units that reduce their inter-unit communications delays. If a modern CPU examines a thousand instructions for potential parallelism, the same complexity in EDGE allows it to examine a thousand hyperblocks, each one consisting of hundreds of instructions. This gives the scheduler considerably better scope for no additional cost. It is this pattern of operation that gives the concept its name; the "graph" is the string of blocks connected by the data flowing between them.

Another advantage of the EDGE concept is that it is massively scalable. A low-end design could consist of a single block engine with a stub scheduler that simply sends in blocks as they are called by the program. An EDGE processor intended for desktop use would instead include hundreds of block engines. Critically, all that changes between these designs is the physical layout of the chip and private information that is known only by the scheduler; a program written for the single-unit machine would run without any changes on the desktop version, albeit thousands of times faster. Power scaling is likewise dramatically improved and simplified; block engines can be turned on or off as required with a linear effect on power consumption.

Perhaps the greatest advantage to the EDGE concept is that it is suitable for running any sort of data load. Unlike modern CPU designs where different portions of the CPU are dedicated to different sorts of data, an EDGE CPU would normally consist of a single type of ALU-like unit. A desktop user running several different programs at the same time would get just as much parallelism as a scientific user feeding in a single program using floating point only; in both cases the scheduler would simply load every block it could into the units. At a low level the performance of the individual block engines would not match that of a dedicated FPU, for instance, but it would attempt to overwhelm any such advantage through massive parallelism.

Implementations

TRIPS

The University of Texas at Austin was developing an EDGE ISA known as TRIPS. In order to simplify the microarchitecture of a CPU designed to run it, the TRIPS ISA imposes several well-defined constraints on each TRIPS hyperblock, they:

have at most 128 instructions,
issue at most 32 loads and/or stores,
issue at most 32 register bank reads and/or writes,
have one branch decision, used to indicate the end of a block.

The TRIPS compiler statically bundles instructions into hyperblocks, but also statically compiles these blocks to run on particular ALUs. This means that TRIPS programs have some dependency on the precise implementation they are compiled for.

In 2003 they produced a sample TRIPS prototype with sixteen block engines in a 4 by 4 grid, along with a megabyte of local cache and transfer memory. A single chip version of TRIPS, fabbed by IBM in Canada using a 130 nm process, contains two such "grid engines" along with shared level-2 cache and various support systems. Four such chips and a gigabyte of RAM are placed together on a daughter-card for experimentation.

The TRIPS team had set an ultimate goal of producing a single-chip implementation capable of running at a sustained performance of 1 TFLOPS, about 50 times the performance of high-end commodity CPUs available in 2008 (the dual-core Xeon 5160 provides about 17 GFLOPS).

CASH

intermediate code called "Pegasus".^[8]

CASH and TRIPS are very similar in concept, but CASH is not targeted to produce output for a specific architecture, and therefore has no hard limits on the block layout.

WaveScalar

The University of Washington's WaveScalar architecture is substantially similar to EDGE, but does not statically place instructions within its "waves". Instead, special instructions (phi, and rho) mark the boundaries of the waves and allow scheduling.^[9]

References

Citations

^ University of Texas at Austin, "TRIPS : One Trillion Calculations per Second by 2012"
^ Pickens, John (17 October 2020). "NMOS 6502 Opcodes".
^ Shirriff, Ken. "Simulating the IBM 360/50 mainframe from its microcode".
^
doi:10.1147/rd.341.0004
.

^ Shimpi, Anand Lal (5 October 2012). "Intel's Haswell Architecture Analyzed: Building a New PC and a New Intel". AnandTech.
doi:10.1145/1394608.1382169
.

^ W. Havanki, S. Banerjia, and T. Conte. "Treegion scheduling for wide-issue processors", in Proceedings of the Fourth International Symposium on High-Performance Computer Architectures, January 1998, pg. 266–276

^ "Phoenix Project"

^ "The WaveScalar ISA"

Bibliography

University of Texas at Austin, "TRIPS Technical Overview"

A. Smith et al., "Compiling for EDGE Architectures", 2006 International Conference on Code Generation and Optimization, March, 2006

v
t
e
Processor technologies
Models

Abstract machine

Stored-program computer

Finite-state machine
with datapath

Hierarchical

Deterministic finite automaton

Queue automaton

Cellular automaton

Quantum cellular automaton

Turing machine
Alternating Turing machine

Universal

Post–Turing

Quantum

Nondeterministic Turing machine

Probabilistic Turing machine

Hypercomputation

Zeno machine

Belt machine

Stack machine

Register machines
Counter

Pointer

Random-access

Random-access stored program

Architecture

Microarchitecture

Von Neumann

Harvard
modified

Dataflow

Transport-triggered

Cellular

Endianness

Memory access
NUMA

HUMA

Load–store

Register/memory

Cache hierarchy

Memory hierarchy
Virtual memory

Secondary storage

Heterogeneous

Fabric

Multiprocessing

Cognitive

Neuromorphic

Instruction set
architectures
Types

Orthogonal instruction set

CISC

RISC

Application-specific

EDGE
TRIPS

VLIW
EPIC

MISC

OISC

NISC

ZISC

VISC architecture

Quantum computing

Comparison
Addressing modes

Instruction
sets

Motorola 68000 series

VAX

PDP-11

x86

ARM

Stanford MIPS

MIPS

MIPS-X

Power
POWER

PowerPC

Power ISA

Clipper architecture

SPARC

SuperH

DEC Alpha

ETRAX CRIS

M32R

Unicore

Itanium

OpenRISC

RISC-V

MicroBlaze

LMC

System/3x0
S/360

S/370

S/390

z/Architecture

Tilera ISA

VISC architecture

Epiphany architecture

Others

Execution
Instruction pipelining

Pipeline stall

Operand forwarding

Classic RISC pipeline

Hazards

Data dependency

Structural

Control

False sharing

Out-of-order

Scoreboarding

Tomasulo's algorithm
Reservation station

Re-order buffer

Register renaming

Wide-issue

Speculative

Branch prediction

Memory dependence prediction

Parallelism
Level

Bit
Bit-serial

Word

Instruction

Pipelining
Scalar

Superscalar

Task
Thread

Process

Data
Vector

Memory

Distributed

Multithreading

Temporal

Simultaneous
Hyperthreading

Simultaneous and heterogenous

Speculative

Preemptive

Cooperative

Flynn's taxonomy

SISD

SIMD
Array processing (SIMT)

Pipelined processing

Associative processing

SWAR

MISD

MIMD
SPMD

Processor
performance

Transistor count

Instructions per cycle (IPC)
Cycles per instruction (CPI)

Instructions per second (IPS)

Floating-point operations per second (FLOPS)

Transactions per second (TPS)

Synaptic updates per second (SUPS)

Performance per watt (PPW)

Cache performance metrics

Computer performance by orders of magnitude

Types

Central processing unit (CPU)

Graphics processing unit (GPU)
GPGPU

Vector

Barrel

Stream

Tile processor

Coprocessor

PAL

ASIC

FPGA

FPOA

CPLD

Multi-chip module (MCM)

System in a package (SiP)

Package on a package (PoP)

By application

Embedded system

Microprocessor

Microcontroller

Mobile

Ultra-low-voltage

ASIP

Soft microprocessor

Systems
on chip

System on a chip (SoC)

Multiprocessor
(MPSoC)

Cypress PSoC

Network on a chip (NoC)

Hardware
accelerators

Coprocessor

AI accelerator

Graphics processing unit (GPU)

Image processor

Vision processing unit (VPU)

Physics processing unit (PPU)

Digital signal processor (DSP)

Tensor Processing Unit (TPU)

Secure cryptoprocessor

Network processor

Baseband processor

Word size

1-bit

4-bit

8-bit

12-bit

15-bit

16-bit

24-bit

32-bit

48-bit

64-bit

128-bit

256-bit

512-bit

bit slicing

others
variable

Core count

Single-core

Multi-core

Manycore

Heterogeneous architecture

Components

Core

Cache
CPU cache

Scratchpad memory

Data cache

Instruction cache

replacement policies

coherence

Bus

Clock rate

Clock signal

FIFO

Functional
units

Arithmetic logic unit (ALU)

Address generation unit (AGU)

Floating-point unit (FPU)

Memory management unit (MMU)
Load–store unit

Translation lookaside buffer (TLB)

Branch predictor

Branch target predictor

Integrated memory controller (IMC)
Memory management unit

Instruction decoder

Logic

Combinational

Sequential

Glue

Logic gate
Quantum

Array

Registers

Processor register

Status register

Stack register

Register file

Memory buffer

Memory address register

Program counter

Control unit

Hardwired control unit

Instruction unit

Data buffer

Write buffer

Microcode ROM

Counter

Datapath

Multiplexer

Demultiplexer

Adder

Multiplier
CPU

Binary decoder
Address decoder

Sum-addressed decoder

Barrel shifter

Circuitry

Integrated circuit
3D

Mixed-signal

Power management

Boolean

Digital

Analog

Quantum

Switch

Power
management

PMU

APM

ACPI

Dynamic frequency scaling

Dynamic voltage scaling

Clock gating

Performance per watt (PPW)

Related

History of general-purpose CPUs

Microprocessor chronology

Processor design

Digital electronics

Hardware security module

Semiconductor device fabrication

Tick–tock model

Pin grid array

Chip carrier

Retrieved from "https://en.wikipedia.org/w/index.php?title=Explicit_data_graph_execution&oldid=1183501098"