メインコンテンツに移動

Using ECC to Correct Soft Errors on the RA Family

画像
Graeme Clark
Graeme Clark
Principle Product Engineer
Published: June 23, 2023

In this short series of blogs, we’ve looked at several of the more specialized peripherals implemented on the RA microcontroller family that help build efficient applications and to offload basic system tasks from the CPU. Today I’d like to look at something slightly different and look and how some of the functions available on the RA family can help increase the reliability of your application. I’d specifically like to look at the error correction function available to use on the RA SRAM to detect and, in many cases, correct soft errors, as this has been a topic of discussion for a number of customers recently.

So first, let’s look at what a soft error is and how it can affect the operation of a microcontroller. A soft error is typically caused by the impact of a high-energy particle, such as a cosmic ray (typically a neutron) on our device, but can also be caused by other types of high-energy particle, such as an alpha particle (typically from a radiation source or the decay of a particle in the environment, or even an impurity in the device packaging).

The impact of the high-energy particle can cause the generation of electron/hole pairs in the device substrate that act as charge carriers. The carriers generated by the particle collide with the drain or gate and can accumulate at the drain, this can cause a large noise signal at the drain, potentially providing a path for the charge to accumulate, and this can result in a bit inversion, the changing of a “1” to a “0” or a “0” to a “1”, so causing a soft error. See the diagram below.

画像
Radiation Particle hit

Today’s microcontrollers are very different from these available even a few years ago, devices in the RA family today can have 100’s of kbytes of on-chip SRAM and are implemented on very advanced process geometries. Even just a few years ago, microcontrollers typically only had a few 10’s of kbytes of SRAM and were implemented on much larger geometries. This meant that in older devices, as there were a much smaller number of gates associated with the SRAM, the chance of a soft error was extremely small. In today's devices, while the chances of such an error are still small, with such a huge amount of SRAM on a chip, the chances of having a soft error have increased significantly.

画像
Memory map of RA6 MCUs with CM33

If this error occurs in a critical piece of data, it can cause the application to fail. In our RA family, we have implemented a number of schemes to protect us from this failure. In the SRAM system, we have implemented both a parity bit that can detect a single-bit error on a block of the SRAM. We have also implemented an Error Correcting Code (ECC) subsystem on another block of the SRAM, this is capable of both detecting and correcting a single-bit error in each long word of SRAM and detecting a 2-bit error. This greatly increases the reliability of data held in the ECC SRAM. In the diagram above, you can see the SRAM with ECC highlighted in yellow.

The ECC system is simple to use, each 32-bit long word has an associated 7-bit ECC field, when you write a value into this SRAM, the ECC field is automatically calculated and stored alongside the long word, so each long word is represented by 39 bits, the 32-bit data, and the 7-bit ECC code.

When you read a long word of 32-bit data from the ECC SRAM area, the chip logic reads the 32-bit data and the 7-bit ECC code, and checks for an error, if it detects a single-bit error, either in the 32-bit data or in the ECC code. It will correct this error on the bus, and this corrected data will be read by the CPU.

It’s important to understand that this system does not correct the underlying 32-bit data held in the SRAM.

The ECC system will set a flag to indicate a single-bit error has occurred, and you also have the choice to generate an interrupt or even generate a reset if you are concerned this could indicate a system issue.

We have chosen not to automatically write the correct value back to SRAM as some designers don’t want cycle disturbances during operation, as the write-back would take extra time. Therefore, we provide a choice that we believe is more flexible, in this case, to write back or not under software control.

Typically, an application will use the interrupt generated by the single-bit error being detected, to read again the long word with the error, and then write it back to the same address. The act of writing the long word back will correct the error, even if the error is in the ECC code, this is also corrected as a new ECC code will be calculated based on the correct 32-bit data.

If the system detects a two-bit error, it can’t be corrected, but an interrupt or a reset can be generated to alert the user that the data in this long word is bad and will have to be managed or reset RA microcontroller.

In many safety-critical systems, it’s important to be able to test that the safety systems are operating correctly, and this ECC SRAM is no exception, the RA family implements a bypass mode that allows you to generate an ECC error in the system by writing directly to the ECC bits, they can only be accessed in this bypass mode.

Like other important functions in all RA microcontrollers, access to these special, critical functions is protected by protection registers which restrict access to registers controlling these functions.

For simplicity, in this blog, I have not covered the specific operation of the ECC function on a specific device, so please check the hardware manual of the specific RA you are using for a detailed description of the device operation. I also have not covered the impact of Trustzone on this feature, however, typically, access to these registers is only allowed from secure applications.

One thing to remember when using the ECC SRAM, when you power up or reset a device, the SRAM is undefined, so always remember to write data into the ECC SRAM before using it and never read the memory before it’s initialized, as this will always cause an ECC error.

You can find out more about the RA family, and specifically the operation of our SRAM ECC function, on our website at www.renesas.com/RA

この記事をシェアする