Feature Articles: ICT Design Center: Design and Assessment Work
Efforts to Minimize Human Errors in Network Maintenance
We describe recent efforts to reduce human error in network maintenance work. Through careful observation and analysis of maintenance work conducted according to a procedure manual, we have been able to identify a number of hidden risk factors in the way that work is carried out.
Providing robust, reliable network facilities is one of NTT Group's most important priorities and NTT's various divisions and departments strive to provide customers with services that are safe and secure. Network maintenance is particularly important and involves various operations: repairing network facilities that have failed, deploying new ones, and decommissioning ones that are no longer required. Human error in performing these various tasks can adversely affect networks—even causing them to crash in the worst case—so efforts to mitigate and prevent errors are extremely important. Here, we introduce the role of NTT's ICT Design Center (IDeC) in reducing human error, taking as an example the network maintenance task of decommissioning leased-line networks (ICT: information and communications technology). Operating companies have already done a fairly good job of reducing human errors by developing procedure manuals for conducting maintenance work, by adopting procedures that incorporate double and triple checks, and by applying many other measures. IDeC wants to cut human errors to the bare minimum and is now working with operating companies to prevent errors by identifying hidden risk factors that still exist in current measures and procedures.
2. Why do human errors occur?
Let us briefly consider why human errors occur in the first place. It is often said that people make mistakes, and that mistakes can never be entirely eliminated. So does this mean that errors are inevitable? Far from being unforeseen one-time events, most human errors have been found through studies to be similar to errors that have occurred in the past. Nor are they minor inconsequential mistakes: more often than not, human errors are serious blunders made repeatedly by experienced, well-organized people . In other words, most human errors have occurred in the past and are likely to reoccur in the future. Another fallacy is that errors are avoided by those familiar with the job. Let us begin by sorting out two types of factors that are involved: factors that make human errors more likely to occur and external factors related to multiple people working together.
The types of human errors that are likely to occur and the factors that contribute them are shown in Fig. 1. The more likely types of error to occur certainly vary with the job, but people are the source of various types of errors. In particular, slip-ups where an omission or action leads to the wrong conclusion even though the person thinks the procedure has been executed correctly  are commonplace (we have all had the experience of calling someone by the wrong name without realizing it). And when one is faced with a difficult situation that one rarely encounters, mistakes are sometimes made as a result of a faulty hypothesis or wrong understanding of the system structure. This is called a problem-solving error. Errors of this type are more prevalent among novices and becomes less frequent as a person gains experience and skill.
Errors originate in peoples' minds, but their incidence and likelihood are increased by various external factors. For example, improper procedures or manuals can lead to errors. Working at the crack of dawn or under other non-ideal conditions, such as when dealing with multiple problems or when pressed for time, can also contribute to errors. Preventing errors from occurring in the human mind in the first place is obviously very difficult, but figuring out what types of factors occur during a particular task is the first step in reducing human errors.
3. Identifying human error risk factors by onsite observation
Here, we consider human errors associated with the task of decommissioning leased-line networks, or more specifically, removing optical network units (ONUs) and other network equipment connected to leased lines that are no longer being used. An error could have a huge impact since removing the wrong device or cable could interrupt other leased-line services, so efforts to prevent errors are extremely important.
Strict procedures have been established for doing this work, and technicians are required to follow them to the letter. If a problem occurs, the technician must stop work immediately and consult a controller back at the maintenance center, who delivers sequential instructions from the procedure manual over the telephone. The controller also monitors leased-line alarms, a system that alerts personnel the instant that a problem occurs (Fig. 2).
With the idea of cutting back on human errors even more, we visited actual leased-line removal sites to observe technicians at work with the procedure manuals to see if we could discover any concealed risk factors using the steps outlined in Fig. 3. First, the leader in charge of orchestrating the work gave us a detailed demonstration of what the work entails at the training facility, while also providing a detailed explanation of why the work needs to be done.
Next, we closely analyzed each work phase using the procedure manual: perception (visual inspection), cognition (mental rehearsal), and action (manual or verbal actions). On the basis of the results of this preliminary assessment, we then considered how the technician's attention varied throughout the procedure, whether the work could be done according to procedure while letting go of the cable, and so on.
Having gained a good understanding of the significance and purpose of the work through the above steps, we then interviewed the technicians and observed them work at actual worksites (typically datacenters). In the interviews, the technicians generally just reiterated the standard procedure, but we asked them picture in their minds how they had actually done the work in the past and to reproduce that actual procedure in as much detail as possible.
Operating companies have already implemented many measures that have sharply reduced the incidence of human error. When we began this project, we expected to face a daunting task and that further reducing the incidence of human errors might be like trying to squeeze water from a stone. It all depended on going back to the starting point of human-centered design and observing from the perspective of the technicians themselves how the work was actually done.
4. Extracted risk factors
There are two aspects of leased-line decommissioning work where human errors can be reduced: (1) in the preparation of the procedure manual that is used simultaneously by the controller and the technician and (2) during the two-way communication over the telephone between the controller and the technician. As one can see in Fig. 4, our study revealed that there are potential risk factors in both of these aspects.
The purpose of having the controller and technician use the same procedure manual is to prevent mistakes and make sure that no procedural steps are skipped. From the interviews, we found that the procedures for this decommissioning task were fairly constant, and since the technicians receive instructions one step at a time from the controller over the telephone*, they do not feel that it is necessary to follow their own copy of the manual closely as they work. We also identified certain risks associated with using the procedure manual. For example, technicians must keep their eyes and hands on the cable at all times as they pull it out, but if they are trying to follow along in the manual at the same time, there is a good chance that their eyes will stray or that they will let go of the cable as they turn the page. People cannot consciously focus on more than one thing at a time, so the manual tends to draw attention away from the cable. For technicians, the advantages of using the manual must outweigh the disadvantages of using it. We must come up with a way of using the manual that minimizes the risks while maximizing the advantages.
The significance of the two-way communication routine is illustrated in Fig. 5. First, the technician grasps the cable to be removed and reads the attached cable tag (a small tag that lists the cable number and other information) to the controller, who verifies that the number is correct. In this way, the controller should catch the error even if the technician inadvertently selects the wrong cable. After confirming that the right cable has been selected, the technician marks it with colored tape and eventually pulls it out (sometimes immediately but other times after doing another job somewhere else). The most important point is that the cable to be removed is positively identified. Would making the job into two-person operation cut down on human errors? One technician could actually pull out the cable and the other could converse with the controller on the phone. However, in the unlikely but entirely possible case that the cable-handling technician inadvertently took hold of the wrong cable, this error could easily go undetected. The whole point of the two-way communication, which aims to reduce errors at the worksite, would be lost by making it into a two-person operation; hence, it would increase the risk factor.
A clear understanding of the work environment is also important for assessing human error risk factors. For this particular task, the work is much more difficult if cables are densely packed in a confined space; touching another cable involves a risk of adversely affecting its performance. Depending on the cable's position, the technician may have to get down on hands and knees or climb a ladder to do the work, while keeping hands and eyes on the cable. Moreover, it is often hard for technicians to hear and be heard over the noise of air conditioners at datacenters, so information might be conveyed incorrectly over the phone (when people hear speech mixed with noise, they naturally tend to make up words that they think fit the context even when they cannot really hear them, and this is also a risk factor). It is apparent that even when work seems relatively simple for technicians, its cognitive process is complicated and imposes a high mental workload. To prevent errors in environments such as these, personnel must know which parts of the procedures are really important and why.
5. Error mitigation recommendations
On the basis of the above analysis, we proposed a number of design changes to the procedure manual while reinforcing the principle objective of two-way communication. First, regarding the procedure manual, rehearsing procedures in your mind as you approach a task is a useful way to reduce errors, so we proposed changes that enable the technician to get an overview of the task by stealing quick glances at the manual while focusing on the task at hand. Specifically, we proposed
• adding schematic figures at the top of each page showing the entire configuration in addition to the current step addressed on the page,
• adding warnings at key points: "Keep your hands on the workpiece at all times!"
• increasing the font size and adding variety on the page,
• adding break points that let technicians catch their breath before tackling riskier procedures, and
• changing the layout so that technicians can fold the manual in half.
We also made the manual easier for controllers to understand and highlighted points that they should be aware of. The operating companies have adopted our recommendations and plan to switch over to technician-friendly manuals in the near future.
We noted that the primary purpose of the two-way communication is to enable the technician doing the actual work to verify that he or she has the right cable without taking his or her hands and eyes off the cable. A reduction in human error is obviously not something that can be achieved immediately, but something that requires repeated effort from various standpoints. The proposals described here can also be incorporated in various well-known approaches to help reduce human errors over the long term.
In this article, we introduced the role of the ICT Design Center (IDeC) in mitigating human error, taking as an example the maintenance task of decommissioning leased-line networks. A number of other divisions are engaged in similar work. While the specific tasks and environments differ, they all involve people performing work. IDeC remains committed to initiatives based on a deeper understanding of the environments that surround people.