Feature Articles: Network Technology for Digital Society of the FutureĦ½Toward Advanced, Smart, and Environmentally Friendly Operations
Automatic Generation of Recovery-command Sequences
We describe technology for automatically generating recovery-command sequences, which is intended to support quick recovery actions by system operators and to achieve automatic recovery from ICT (information and communication technology)-system failures.
Keywords: recovery-command sequence, seq2seq, automation
In current large-scale ICT (information and communication technology) systems, troubleshooting has become more complicated due to the diversification of the causes of network failures. The increase in operational costs has also become a serious problem. We are developing technology for automatic generation of recovery-command sequences that is designed to help system operators recover from failures quickly and achieve automated recovery operations .
2. Overview of technology
An overview of our technology is shown in Fig. 1. Sequences of recovery commands are estimated by using a sequence-to-sequence technique (seq2seq) , which is a neural-network model that learns the relationship between an input sequence and an output sequence (Fig. 2).
Seq2seq is widely used in translation systems and dialog tasks. In our technology, we use a sequence that consists of a series of log identifiers (IDs) as an input sequence. The log IDs are generated by associating system logs and alarms related to system failures with unique numbers . We also use a sequence of words that consists of a recovery-command sequence as an output sequence. Learning the relationship between the input sequence and the output sequence makes it possible to estimate a command sequence that will restore the system when a new failure occurs.
When the command sequence estimated in this method is executed, it is necessary to measure the reliability of the estimation and the impact on the system of the command sequence. In our technology, we define the reliability of a command sequence by multiplying the generated probabilities of each word that composes the recovery-command sequence. Thus, the reliability can be regarded as a probability of the system recovery when the obtained command sequence is executed. Moreover, we can define the impact on the system by using the information about the impact on performance of the system when recovery-command sequences were executed in past failures. These indicators (i.e., reliability and impact) can be used to decide whether to execute the obtained command sequence.
3. Future work
We will continue to work on verifying our technology by using data obtained from commercial systems and improving the accuracy of the estimated recovery-command sequences. We will also improve the definitions of the reliability and the impact from the viewpoint of practical system operation to achieve automated recovery operations.