Cray XMT
Big-endian | |
Predecessor | Cray MTA-2 |
---|---|
Successor | Cray XMT2 |
Registers | |
32 general-purpose per stream (4096 per CPU) 8 target per stream (1024 per CPU) |
Cray XMT (Cray eXtreme MultiThreading,
Cray XMT uses a scrambled[3] content-addressable memory[6] model on DDR1 ECC modules to implicitly load-balance memory access across the whole shared global address space of the system.[5] Use of 4 additional Extended Memory Semantics bits (full/empty, forwarding and 2 trap bits) per 64-bit memory word enables lightweight, fine-grained synchronization on all memory.[7] There are no hardware interrupts and hardware threads are allocated by an instruction, not the OS.[5][7]
Front-end (login, I/O, and other service nodes, utilizing
Threadstorm3
General information | |
---|---|
Launched | 2005 |
Discontinued | 2011 |
Designed by | Cray |
Performance | |
Max. CPU clock rate | 500 MHz |
HyperTransport speeds | to 300 GT/s |
Architecture and classification | |
Instruction set | MTA ISA |
Physical specifications | |
Cores |
|
Socket(s) | |
History | |
Predecessor(s) | Cray MTA-2 CPU |
Successor(s) | Threadstorm4 |
Threadstorm3 (referred to as "MT processor"[2] and Threadstorm before XMT2[8]) is a 64-bit single-core VLIW barrel processor (compatible with 940-pin Socket 940 used by AMD Opteron processors) with 128 hardware streams, onto each a software thread can be mapped (effectively creating 128 hardware threads per CPU), running at 500 MHz and using the MTA instruction set or a superset of it.[7][9][nb 1] It has a 128KB, 4-way associative data buffer. Each Threadstorm3 has 128 separate register sets and program counters (one per each stream), which are fairly[10] fully context-switched at each cycle.[5] Its estimated peak performance is 1.5 GFLOPS. It has 3 functional units (memory, fused multiply-add and control), which receive operations from the same MTA instruction and operate within the same cycle.[7] Each stream has 32 general-purpose registers, 8 target registers and a status word, containing the program counter.[6] High-level control of job allocation across threads is not possible.[5][nb 2] Due to the MTA's pipeline length of 21, each stream is selected to execute instructions again no prior than 21 cycles later.[11] The TDP of the processor package is 30 W.[12]
Due to their thread-level context switch at each cycle, performance of Threadstorm CPUs is not constrained by memory access time. In a simplified model, at each clock cycle an instruction from one of the threads is executed and another memory request is queued with the understanding that by the time the next round of execution is ready the requested data has arrived.[13] This is contrary to many conventional architectures which stall on memory access. The architecture excels in data walking schemes where subsequent memory access cannot be easily predicted and thus wouldn't be well suited to a conventional cache model.[1] Threadstorm's principal architect was Burton J. Smith.[1]
Cray XMT2
Designer | Big-endian |
---|---|
Predecessor | Cray XMT |
Registers | |
32 general-purpose per stream (4096 per CPU) 8 target per stream (1024 per CPU) |
Cray XMT2
Threadstorm4
AMD Opteron processors) with 128 hardware streams, very similar to its predecessor, Threadstorm3. It features an improved, DDR2-capable memory controller and additional 8 trap registers per stream. Cray intentionally decided against a DDR3 controller, citing the reusing of existing Cray XT5 infrastructure[nb 4] and a shorter burst length than DDR3.[nb 5] Though the longer burst length could be compensated by higher speeds of DDR3, it would also require more power, which Cray engineers wanted to avoid.[8]
ScorpioAfter launching XMT, Cray researched a possible multi-core variant of the Threadstorm3, dubbed Scorpio. Most of Threadstorm3's features would be retained, including the multiplexing of many hardware streams onto an execution pipeline and the implementation of additional state bits for every 64-bit memory word. Cray later abandoned Scorpio, and the project yielded no manufactured chip.[3] FutureDevelopment on Threadstorm4, as well as the whole MTA architecture, ended silently after XMT2, probably due to competition from commodity processors such as Intel's Xeon[14] and possibly Xeon Phi, even though Cray never officially discontinued neither XMT nor XMT2. As of 2020, Cray has removed all customer documentation on both XMT and XMT2 from its online catalogue. UsersCray XMT2 was bought by several federal laboratories and academic facilities, as well as some commercial HPC clients: e.g. CSCS (2 TB global memory with 64 Threadstorm4 CPUs),[15] Noblis CAHPC.[16] Most of XMT and XMT2-based systems have been decommissioned by 2020. Notes
References
|