Vol. 18, No. 12, pp. 97–102, Dec. 2020. https://doi.org/10.53829/ntr202012ra1
Soft Error Countermeasure for 10G-EPON ONU
We developed a soft error countermeasure for the 10 Gigabit Ethernet passive optical network (10G-EPON) optical network unit (ONU). This countermeasure allows an ONU to detect a soft error and reboot by autonomously turning its power off/on. This feature reduces the rate of inquiries because of soft error failure in 10G-EPON ONUs. It also reduces the number of operations required to respond to failures in telecommunication services.
Keywords: ONU, soft error, reboot
1. Overview of the 10G-EPON system
The passive optical network (PON) system is a fiber-to-the-home optical access system that uses an optical splitter to branch optical signals from a single optical fiber shared by multiple users. The 10 Gigabit-Ethernet PON (10G-EPON) system is a PON system with a maximum transmission rate of 10 Gbit/s. This system consists of an optical line terminal (OLT) installed at the central office, optical network units (ONUs) installed in each user’s home, an optical splitter, and optical fiber network connecting them (Figs. 1 and 2).
2. Soft error in the ONU
A soft error is an event in which a bit in memory is inverted by electrical noise. Operation failures caused by soft errors can be recovered by resetting the electronic device or overwriting the data . Soft errors are primarily caused by cosmic rays (high energy protons). Cosmic rays collide with atomic nuclei (nitrogen and oxygen) to generate neutron rays, which then collide with silicon atomic nuclei in semiconductor devices, generating electrical noise (Fig. 3).
If ONU operation fails due to soft error, network service may be disconnected. An example of a soft error failure is that due to electrical noise generated in a semiconductor device, where bit inversion occurs in the main signal path control function written in the volatile memory (random access memory: RAM) of the PON-Media Access Control (MAC) large-scale integrated circuit. The main signal is thus disrupted, resulting in communication interruption. In this case, by rebooting the ONU (turning the power off/on), the information or program in RAM is refreshed from nonvolatile memory (read only memory: ROM), which recovers communication. When such a soft error failure occurs, it is necessary to restart the ONU by turning it off and on at the user’s home.
Soft errors occur very infrequently in semiconductor devices, making it difficult to isolate and identify faults. For ONUs that are distributed and deployed in large numbers, it is important to improve the efficiency of responding to soft error failure.
Further advances in semiconductor miniaturization are inevitable given the demand for 10 Gbit/s-class high-speed communication. As these semiconductor devices hold less charge in a memory cell, they are more susceptible to neutron rays. Therefore, it is assumed that the occurrence rate of soft errors and frequency of troubleshooting will increase.
Error correction for soft errors includes (1) autonomous correction by using hardware, (2) autonomous correction by using a device-control program, and (3) correction by manual intervention . (1) Autonomous correction by using hardware uses components with functions such as error check code (ECC) correction. (2) Autonomous correction by using a device-control program includes methods such as system reset and device reset. The system can be reset without any special control, just like a power off/on restart. Device reset resets just the target device (components), which shortens the correction time, but requires coordination with peripheral circuits for state matching, resulting in complex control. (3) Manual correction can be done by remote control reset by the maintainer or by instructing the user to turn the power off/on. Either approach takes a long time to correct the problem, and it is necessary for the telecommunication company to respond to the problem.
For a 10G-EPON ONU, we developed a function (autonomous reset function) that corresponds to the system reset for (2); the ONU detects a soft error and restarts itself by turning its power off and on autonomously. Compared with (1) and device reset for (2), this function is economical, easy to implement, and reduces the amount of manual operation for error correction.
3. Soft error detection and autonomous reset-target classification
The functional block of an ONU is shown in Fig. 4. An uplink signal input from a user network interface (UNI) connected to user equipment, etc. is received by the physical layer (PHY), priority control and transfer processing are executed by the PON-MAC processing unit, and if enabled in the encryption unit and forward error correction (FEC) unit, the signal is encrypted and an error correction code is assigned to each block and transmitted from the optical module to the OLT. In addition, the downlink signal input from the OLT PON interface (IF) is received by the optical module, corrected by the FEC unit, decoded by the encryption processing unit, subjected to priority control and transfer processing by the PON-MAC processing unit, and transmitted from the PHY to the user equipment.
Soft error is first detected in the frame-buffer area of the RAM of each functional block of the main signal path through which the uplink and downlink signals mentioned above pass. Most soft errors detected are bits in a single frame in the frame-buffer area, which are discarded after detection and have little effect on communication. However, in rare cases, similar soft errors are continuously detected in the frame-buffer areas of the PHY, PON-MAC processing unit, encryption unit, and FEC unit. In this case, since the soft errors are continuous and extend over multiple frames, it is assumed that there is a problem with the continuity of the main signal in the ONU. Therefore, it is determined that a soft error event occurred, and the autonomous reset function is triggered.
Our countermeasure also detects errors in the RAM of the central processing unit (CPU) and the set-value storage area for priority control and transfer processing of the RAM of the PON-MAC processing unit. If a soft error is detected in the RAM of the CPU, it is assumed that CPU processing is abnormal. If a soft error is detected in the transfer-control-setting table of the PON-MAC processing unit, it is also assumed that there is a problem with the main signal transfer to the OLT. Therefore, both are judged to be soft errors, which triggers the autonomous reset function.
4. Autonomous reset function as a soft error countermeasure
Figure 5 shows the transition flow of the autonomous reset function. A semiconductor device affected by a neutron beam may suffer a physical defect as well as soft error. A physical defect is an event in which the structure of the semiconductor device degrades and malfunctions due to the effect of neutron beams; such failures (physical defect failure) cannot be recovered by restarting the semiconductor device or overwriting data. This suggests that the autonomous reset function may repeatedly turn the power supply off and on.
Our solution is a mechanism that memorizes the number of resets and, when the number of resets exceeds a threshold within a certain period, a physical-defect failure is determined and the ONU communication function is terminated along with autonomous reset. Specifically, reset counter i is provided to record the number of resets in the ONU. After the ONU is started, reset counter i is initialized to 0, and each autonomous reset increments the counter value by 1. When autonomous reset occurs repeatedly, reset counter i increases, and if the counter value exceeds the specified threshold, the device transitions to a failed state. This prevents endless repetition of autonomous reset and enables soft error failure to be distinguished from physical-defect failure.
5. The results of our soft error countermeasure with neutron irradiation test
Neutron irradiation tests are generally conducted to reproduce soft errors. To measure the effectiveness of our soft error countermeasure, we subjected 10G-EPON ONUs running the autonomous reset function to the neutron irradiation test  provided by NTT Advanced Technology based on the Telecommunication Technology Committee (TTC) standard JT-K 130 .
In the neutron irradiation test, the target (beryllium) is irradiated with protons accelerated using a cyclotron proton accelerator to generate neutron beams, and several 10G-EPON ONUs are irradiated simultaneously. This test uses a very high neutron intensity corresponding to about 100 million times the natural exposure rate (Fig. 6).
A main signal was connected to the 10G-EPON ONUs using an OLT and traffic-generator analyzer. To assess the possibility of log-file destruction due to the effect of neutron beams during the test, we collected the logs inside the 10G-EPON ONUs in real time and measured all autonomous resets. Through this test, the 10G-EPON ONUs were exposed to neutron beams equivalent to about 110,000 years per unit on average. Due to the irradiation, soft error was detected 1150 times per unit on average; among them, errors in the frame-buffer area, not the subject of autonomous reset, were confirmed 1104 times per unit, and fault recovery by autonomous reset was confirmed 46 times per unit.
Because the autonomous reset function worked as expected, it was confirmed that when the ONU exhibits a soft error fault, it can recover autonomously; thus, this is effective as a soft error countermeasure. If we assume that 10 million units will be deployed, we can expect to reduce the number of failures per year by about 4200.
6. Future plans
We developed a countermeasure against soft errors, which are expected to become more frequent with the adoption of 10G-EPON ONUs. This countermeasure will contribute to the reduction in ONU failures and in manual operations required by the user or carrier personnel. In the future, we will examine the application of this countermeasure to other types of ONUs.