Scratchpad memory

Source: Wikipedia, the free encyclopedia.

Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in

CPU
), scratchpad refers to a special high-speed memory used to hold small items of data for rapid retrieval. It is similar to the usage and size of a scratchpad in life: a pad of paper for preliminary notes or sketches or writings, etc. When the scratchpad is a hidden portion of the main memory then it is sometimes referred to as bump storage.

In some systems

main memory, often using DMA-based data transfer.[1] In contrast to a system that uses caches, a system with scratchpads is a system with non-uniform memory access
(NUMA) latencies, because the memory access latencies to the different scratchpads and the main memory vary. Another difference from a system that employs caches is that a scratchpad commonly does not contain a copy of data that is also stored in the main memory.

Scratchpads are employed for simplification of caching logic, and to guarantee a unit can work without main memory contention in a system employing multiple processors, especially in

realtime applications
, where predictable timing is hindered by cache behavior.

Scratchpads are not used in mainstream desktop processors where generality is required for

MPSoC
, and where software is often tuned to one hardware configuration.

Examples of use

Alternatives

Cache control vs scratchpads

Some architectures such as PowerPC attempt to avoid the need for cacheline locking or scratchpads through the use of

cache control instructions
. Marking an area of memory with "Data Cache Block: Zero" (allocating a line but setting its contents to zero instead of loading from main memory) and discarding it after use ('Data Cache Block: Invalidate', signaling that main memory didn't receive any updated data) the cache is made to behave as a scratchpad. Generality is maintained in that these are hints and the underlying hardware will function correctly regardless of actual cache size.

Shared L2 vs Cell local stores

Regarding interprocessor communication in a multicore setup, there are similarities between the Cell's inter-localstore DMA and a shared L2 cache setup as in the Intel Core 2 Duo or the Xbox 360's custom powerPC: the L2 cache allows processors to share results without those results having to be committed to main memory. This can be an advantage where the working set for an algorithm encompasses the entirety of the L2 cache. However, when a program is written to take advantage of inter-localstore DMA, the Cell has the benefit of each-other-Local-Store serving the purpose of BOTH the private workspace for a single processor AND the point of sharing between processors; i.e., the other Local Stores are on a similar footing viewed from one processor as the shared L2 cache in a conventional chip. The tradeoff is that of memory wasted in buffering and programming complexity for synchronization, though this would be similar to precached pages in a conventional chip. Domains where using this capability is effective include:

  • Pipeline processing (where one achieves the same effect as increasing the L1 cache's size by splitting one job into smaller chunks)
  • Extending the working set, e.g., a sweet spot for a merge sort where the data fits within 8×256 KB
  • Shared code uploading, like loading a piece of code to one SPU, then copy it from there to the others to avoid hitting the main memory again

It would be possible for a conventional processor to gain similar advantages with cache-control instructions, for example, allowing the prefetching to the L1 bypassing the L2, or an eviction hint that signaled a transfer from L1 to L2 but not committing to main memory; however, at present no systems offer this capability in a usable form and such instructions in effect should mirror explicit transfer of data among cache areas used by each core.

See also

Notes

  1. UNIVAC 1107
    , all addressable registers were held in scratchpad.

References

  1. ^ Steinke, Stefan; Lars Wehmeyer; Bo-Sik Lee; Peter Marwedel (2002). "Assigning Program and Data Objects to Scratchpad for Energy Reduction" (PDF). University of Dortmund. Retrieved 3 October 2013.: "3.2 Scratchpad model .. The scratchpad memory uses software to control the location assignment of data."
  2. ^ "The TI-99/4A internal architecture". www.unige.ch. Retrieved 2023-03-08.
  3. ^ J. Lu, K. Bai, A. Shrivastava, "SSDM: Smart Stack Data Management for Software Managed Multicores (SMMs)", Design Automation Conference (DAC), June 2–6, 2013
  4. ^ K. Bai, A. Shrivastava, "Automatic and Efficient Heap Data Management for Limited Local Memory Multicore Architectures", Design Automation and Test in Europe (DATE), 2013
  5. ^ K. Bai, J. Lu, A. Shrivastava, B. Holton, "CMSM: An Efficient and Effective Code Management for Software Managed Multicores", CODES+ISSS, 2013
  6. ^ Patterson, David (September 30, 2009). "The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges" (PDF). Parallel Computing Research Laboratory & NVIDIA. Retrieved 3 October 2013.
  7. .

External links