Soft error
This article needs additional citations for verification. (November 2011) |
In
In a computer's memory system, a soft error changes an instruction in a program or a data value. Soft errors typically can be remedied by
There are two types of soft errors, chip-level soft error and system-level soft error. Chip-level soft errors occur when particles hit the chip, e.g., when secondary particles from cosmic rays land on the silicon die. If a particle with certain properties hits a memory cell it can cause the cell to change state to a different value. The atomic reaction in this example is so tiny that it does not damage the physical structure of the chip. System-level soft errors occur when the data being processed is hit with a noise phenomenon, typically when the data is on a data bus. The computer tries to interpret the noise as a data bit, which can cause errors in addressing or processing program code. The bad data bit can even be saved in memory and cause problems at a later time.
If detected, a soft error may be corrected by rewriting correct data in place of erroneous data. Highly reliable systems use
Critical charge
Whether or not a circuit experiences a soft error depends on the energy of the incoming particle, the geometry of the impact, the location of the strike, and the design of the logic circuit. Logic circuits with higher capacitance and higher logic voltages are less likely to suffer an error. This combination of capacitance and voltage is described by the critical charge parameter, Qcrit, the minimum electron charge disturbance needed to change the logic level. A higher Qcrit means fewer soft errors. Unfortunately, a higher Qcrit also means a slower logic gate and a higher power dissipation. Reduction in chip feature size and supply voltage, desirable for many reasons, decreases Qcrit. Thus, the importance of soft errors increases as chip technology advances.
In a logic circuit, Qcrit is defined as the minimum amount of induced charge required at a circuit node to cause a voltage pulse to propagate from that node to the output and be of sufficient duration and magnitude to be reliably latched. Since a logic circuit contains many nodes that may be struck, and each node may be of unique capacitance and distance from output, Qcrit is typically characterized on a per-node basis.
Causes of soft errors
Alpha particles from package decay
Soft errors became widely known with the introduction of
Package radioactive decay usually causes a soft error by
A 2011 Black Hat paper discusses the real-life security implications of such bit-flips in the Internet's Domain Name System. The paper found up to 3,434 incorrect requests per day due to bit-flip changes for various common domains. Many of these bit-flips would probably be attributable to hardware problems, but some could be attributed to alpha particles.[1] These bit-flip errors may be taken advantage of by malicious actors in the form of bitsquatting.
Isaac Asimov received a letter congratulating him on an accidental prediction of alpha-particle RAM errors in a 1950s novel.[2]
Cosmic rays creating energetic neutrons and protons
Once the electronics industry had determined how to control package contaminants, it became clear that other causes were also at work. James F. Ziegler led a program of work at IBM which culminated in the publication of a number of papers (Ziegler and Lanford, 1979) demonstrating that cosmic rays also could cause soft errors. Indeed, in modern devices, cosmic rays may be the predominant cause. Although the primary particle of the cosmic ray does not generally reach the Earth's surface, it creates a shower of energetic secondary particles. At the Earth's surface approximately 95% of the particles capable of causing soft errors are energetic neutrons with the remainder composed of protons and pions.[3] IBM estimated in 1996 that one error per month per 256
Cosmic ray flux depends on altitude. For the common reference location of 40.7° N, 74° W at sea level (New York City, NY, USA), the flux is approximately 14 neutrons/cm2/hour. Burying a system in a cave reduces the rate of cosmic-ray-induced soft errors to a negligible level. In the lower levels of the atmosphere, the flux increases by a factor of about 2.2 for every 1000 m (1.3 for every 1000 ft) increase in altitude above sea level. Computers operated on top of mountains experience an order of magnitude higher rate of soft errors compared to sea level. The rate of upsets in aircraft may be more than 300 times the sea level upset rate. This is in contrast to package decay-induced soft errors, which do not change with location.[5] As chip density increases, Intel expects the errors caused by cosmic rays to increase and become a limiting factor in design.[4]
The average rate of cosmic-ray soft errors is inversely proportional to sunspot activity. That is, the average number of cosmic-ray soft errors decreases during the active portion of the
One experiment measured the soft error rate at the sea level to be 5,950
Energetic neutrons produced by cosmic rays may lose most of their kinetic energy and reach thermal equilibrium with their surroundings as they are scattered by materials. The resulting neutrons are simply referred to as
Thermal neutrons
Neutrons that have lost kinetic energy until they are in thermal equilibrium with their surroundings are an important cause of soft errors for some circuits. At low energies many
Boron has been used in BPSG, the insulator in the interconnection layers of integrated circuits, particularly in the lowest one. The inclusion of boron lowers the melt temperature of the glass providing better reflow and planarization characteristics. In this application the glass is formulated with a boron content of 4% to 5% by weight. Naturally occurring boron is 20% 10B with the remainder the 11B isotope. Soft errors are caused by the high level of 10B in this critical lower layer of some older integrated circuit processes. Boron-11, used at low concentrations as a p-type dopant, does not contribute to soft errors. Integrated circuit manufacturers eliminated borated dielectrics by the time individual circuit components decreased in size to 150 nm, largely due to this problem.
In critical designs, depleted boron—consisting almost entirely of boron-11—is used, to avoid this effect and therefore to reduce the soft error rate. Boron-11 is a by-product of the nuclear industry.
For applications in medical electronic devices this soft error mechanism may be extremely important. Neutrons are produced during high-energy cancer radiation therapy using photon beam energies above 10 MeV. These neutrons are moderated as they are scattered from the equipment and walls in the treatment room resulting in a thermal neutron flux that is about 40 × 106 higher than the normal environmental neutron flux. This high thermal neutron flux will generally result in a very high rate of soft errors and consequent circuit upset.[8][9]
Other causes
Soft errors can also be caused by
Some tests conclude that the isolation of
Designing around soft errors
Soft error mitigation
A designer can attempt to minimize the rate of soft errors by judicious device design, choosing the right semiconductor, package and substrate materials, and the right device geometry. Often, however, this is limited by the need to reduce device size and voltage, to increase operating speed and to reduce power dissipation. The susceptibility of devices to upsets is described in the industry using the JEDEC JESD-89 standard.
One technique that can be used to reduce the soft error rate in digital circuits is called radiation hardening. This involves increasing the capacitance at selected circuit nodes in order to increase its effective Qcrit value. This reduces the range of particle energies to which the logic value of the node can be upset. Radiation hardening is often accomplished by increasing the size of transistors who share a drain/source region at the node. Since the area and power overhead of radiation hardening can be restrictive to design, the technique is often applied selectively to nodes which are predicted to have the highest probability of resulting in soft errors if struck. Tools and models that can predict which nodes are most vulnerable are the subject of past and current research in the area of soft errors.
Detecting soft errors
There has been work addressing soft errors in processor and memory resources using both hardware and software techniques. Several research efforts addressed soft errors by proposing error detection and recovery via hardware-based redundant multi-threading.[13][14][15] These approaches used special hardware to replicate an application execution to identify errors in the output, which increased hardware design complexity and cost including high performance overhead. Software-based soft error tolerant schemes, on the other hand, are flexible and can be applied on commercial off-the-shelf microprocessors. Many works propose compiler-level instruction replication and result checking for soft error detection. [16][17] [18]
Correcting soft errors
Designers can choose to accept that soft errors will occur, and design systems with appropriate error detection and correction to recover gracefully. Typically, a semiconductor memory design might use
Soft errors in
Traditionally,
The design of error detection and correction circuits is helped by the fact that soft errors usually are localised to a very small area of a chip. Usually, only one cell of a memory is affected, although high energy events can cause a multi-cell upset. Conventional memory layout usually places one bit of many different correction words adjacent on a chip. So, even a multi-cell upset leads to only a number of separate
Soft errors in combinational logic
The three natural masking effects in combinational logic that determine whether a
If all three masking effects fail to occur, the propagated pulse becomes latched and the output of the logic circuit will be an erroneous value. In the context of circuit operation, this erroneous output value may be considered a soft error event. However, from a microarchitectural-level standpoint, the affected result may not change the output of the currently-executing program. For instance, the erroneous data could be overwritten before use, masked in subsequent logic operations, or simply never be used. If erroneous data does not affect the output of a program, it is considered to be an example of microarchitectural masking.
Soft error rate
Soft error rate (SER) is the rate at which a device or system encounters or is predicted to encounter soft errors. It is typically expressed as either the number of failures-in-time (FIT) or mean time between failures (MTBF). The unit adopted for quantifying failures in time is called FIT, which is equivalent to one error per billion hours of device operation. MTBF is usually given in years of device operation; to put it into perspective, one FIT equals to approximately 1,000,000,000 / (24 × 365.25) = 114,077 times longer between errors than one-year MTBF.
While many electronic systems have an MTBF that exceeds the expected lifetime of the circuit, the SER may still be unacceptable to the manufacturer or customer. For instance, many failures per million circuits due to soft errors can be expected in the field if the system does not have adequate soft error protection. The failure of even a few products in the field, particularly if catastrophic, can tarnish the reputation of the product and company that designed it. Also, in safety- or cost-critical applications where the cost of system failure far outweighs the cost of the system itself, a 1% risk of soft error failure per lifetime may be too high to be acceptable to the customer. Therefore, it is advantageous to design for low SER when manufacturing a system in high-volume or requiring extremely high reliability.
See also
- Single event upset(SEU)
- Glitch
- Don't care
- Logic hazard
References
- ^ Artem Dinaburg (July 2011). "Bitsquatting - DNS Hijacking without Exploitation" (PDF). Archived from the original (PDF) on 2018-06-11. Retrieved 2011-12-26.
- Gold(1995): "This letter is to inform you and congratulate you on another remarkable scientific prediction of the future; namely your foreseeing of the dynamic random-access memory (DRAM) logic upset problem caused by alpha particle emission, first observed in 1977, but written about by you in Caves of Steel in 1957." [Note: Actually, 1952.] ... "These failures are caused by trace amounts of radioactive elements present in the packaging material used to encapsulate the silicon devices ... in your book, Caves of Steel, published in the 1950s, you use an alpha particle emitter to 'murder' one of the robots in the story, by destroying ('randomizing') its positronic brain. This is, of course, as good a way of describing a logic upset as any I've heard ... our millions of dollars of research, culminating in several international awards for the most important scientific contribution in the field of reliability of semiconductor devices in 1978 and 1979, was predicted in substantially accurate form twenty years [Note: twenty-five years, actually] before the events took place
- ^
Ziegler, J. F. (January 1996). "Terrestrial cosmic rays". ISSN 0018-8646.
- ^ a b Simonite, Tom (March 2008). "Should every computer chip have a cosmic ray detector?". New Scientist. Archived from the original on 2011-12-02. Retrieved 2019-11-26.
- S2CID 9573484.
- ^ Dell, Timothy J. (1997). "A White Paper on the Benefits of Chipkill-Correct ECC for PC Server Main Memory" (PDF). ece.umd.edu. p. 13. Retrieved 2021-11-03.
- S2CID 110078856.
- S2CID 20789261.
- ^ Franco, L., Gómez, F., Iglesias, A., Pardo, J., Pazos, A., Pena, J., Zapata, M., SEUs on commercial SRAM induced by low energy neutrons produced at a clinical linac facility, RADECS Proceedings, September 2005
- S2CID 14464953.
- IEEE. Retrieved 2015-03-10.
- ^ Goodin, Dan (2015-03-10). "Cutting-edge hack gives super user status by exploiting DRAM weakness". Ars Technica. Retrieved 2015-03-10.
- ISSN 0163-5964.
- S2CID 1909214.
- S2CID 2270600.
- .
- )
- )
Further reading
- Ziegler, J. F.; Lanford, W. A. (1979). "Effect of Cosmic Rays on Computer Memories". Science. 206 (4420): 776–788. S2CID 2000982.
- Mukherjee, S., "Architecture Design for Soft Errors," Elsevier, Inc., February 2008.
- Mukherjee, S., "Computer Glitches from Soft Errors: A Problem with Multiple Solutions," Microprocessor Report, 19 May 2008.
External links
- Soft Errors in Electronic Memory - A White Paper - A good summary paper with many references - Tezzaron January 2004. Concludes that 1000–5000 FIT per Mbit (0.2–1 error per day per Gbyte) is a typical DRAM soft error rate.
- Benefits of Chipkill-Correct ECC for PC Server Main Memory - A 1997 discussion of SDRAM reliability - some interesting information on "soft errors" from Error-correcting codeschemes
- Soft errors' impact on system reliability - Ritesh Mastipuram and Edwin C. Wee, Cypress Semiconductor, 2004
- Scaling and Technology Issues for Soft Error Rates - A Johnston - 4th Annual Research Conference on Reliability Stanford University, October 2000
- Evaluation of LSI Soft Errors Induced by Terrestrial Cosmic rays and Alpha Particles - H. Kobayashi, K. Shiraishi, H. Tsuchiya, H. Usuki (all of Sony), and Y. Nagai, K. Takahisa (Osaka University), 2001.
- SELSE Workshop Website - Website for the workshop on the System Effects of Logic Soft Errors