To view PDF files

You need Adobe Reader 7.0 or later in order to read PDF files on this site.
If Adobe Reader is not installed on your computer, click the button below and go to the download site.

Feature Articles: Toward More Robust Networks

Vol. 21, No. 12, pp. 24–26, Dec. 2023. https://doi.org/10.53829/ntr202312fa2

Technologies for Promptly Understanding Network Conditions When Large-scale System Failure Occurs

Kazuaki Akashi and Shunsuke Kanai

Abstract

NTT laboratories are developing enhanced operational technologies to make communication networks robust. In this article, we introduce NTT research and development initiatives to promptly understand the network conditions and the impact of failures on complex network services when a large-scale system failure occurs.

Keywords: communication network, fault management, large-scale failure

PDF PDF

1. Introduction

NTT laboratories are researching and developing operational technologies to make communication networks robust. In this article, we introduce the Network Operation Injected Model (NOIM), which centrally manages networks composed of various communication protocols and enables early detection of the impact of failures on network services, DeAnoS™ (Deep Anomaly Surveillance), which detects anomalies, and Konan (Knowledge-based Autonomous Failure-Event Analysis Technology for Networks), which detects the cause of failures by using pre-learned rules.

2. Network-resource-management technology

Network services are provided by multilayer networks that combine different communication protocols, such as optical-transmission networks, Ethernet networks, and Internet protocol (IP) networks. Because these networks are typically managed by separate systems, the effects of a network failure at one layer on networks and services at other layers must be manually analyzed. However, when a large-scale failure occurs, it takes a long time to understand the network status and service impact because the location of the failure and its impact are widespread.

We developed the NOIM to enable early understanding of service impact [1]. The NOIM enables the central management of complex multilayer networks and quick understanding of the impact of failures on services by expressing network information in a common data model that does not depend on layers such as termination points and forwarding relationships.

The NOIM uses a common data model that is based on the Shared Information/Data Model (SID) discussed in the TM Forum [2] to achieve network-resource management. It uses the entities defined as the Physical Resource and Logical Resource in the SID. The SID Physical Resource defines entities, i.e., PhysicalDevice, PhysicalLink, PhysicalStructure, and AggregationStructure, that represent physical resources, i.e., routers, cables, buildings, and pipelines, respectively. Similarly, the SID Logical Resource defines entities, i.e., TerminationPointEncapsulation, the NetworkForwardingDomain, and ForwardingRelationshipEncapsulation, that represent the logical resources of a network, i.e., termination points, relationships, and domain in each layer, respectively. By using these entities to represent each resource and the connection relationship between resources, as shown in Fig. 1, the NOIM makes it possible to manage multilayer networks composed of various communication protocols.


Fig. 1. Multilayer network-resource management with common data model.

The NOIM also provides a mechanism for externally defining characteristics specific to each communication protocol and managing them by linking them to the above generic entities. Example characteristics are IP addresses in IP networks and VLAN IDs (virtual local area network identifiers) in Ethernet networks. Common characteristics that do not depend on the communication protocol are represented by entities, and unique characteristics can be added as external definitions, allowing flexibility when adding or changing communication protocols and services.

The NOIM quickly determines service impact when a failure occurs. For example, when a pipeline is cut, as shown in Fig. 1, due to a natural disaster, the cables contained in the pipeline and the logical resources of each layer on the cables are affected by the failure. The NOIM can easily determine which physical and logical resources are affected by a failure by tracing relationships between resources that are represented with common entities such as AggregationStructure, PhysicalLink, TerminationPointEncapsulation, and ForwardingRelationshipEncapsulation. There is no need to change the trace logic even if the network layer configuration changes.

Unlike cable disconnections or building blackouts caused by natural disasters, a network system failure causes parts of the network and service to become unstable. To accurately determine the area affected by a system failure and the number of users in such an event, it is necessary to determine service impact by considering traffic flows for each service and unstable paths in the network. Therefore, as shown in Fig. 2, we are working on the following enhancements to the NOIM to support large-scale failure response.


Fig. 2. Overview of network-resource-management technology for large-scale failures.

(1) Adding a service-path layer

To provide a more granular view of service impact, we add a service-path layer to the common data model in the NOIM. This layer of logical resources represents the end-to-end connectivity of the services provided on the network. Service impact is currently determined at the same granularity as communication protocols, but the addition of a service-path layer makes it possible to determine service impact at any granularity, such as per area or per user.

(2) Understanding service impact by considering traffic flows

We enable the NOIM to identify unstable services due to network system failures on the basis of whether the traffic for each service is flowing through an unstable communication path. Therefore, we represent the traffic flow of each service as a combination of resources and determine the status of the service on the basis of the status of those resources.

(3) Understanding unstable communication paths

We identify unstable communication paths due to congestion by comparing the bandwidth required for each service path with that of the resources traversed by the traffic flow. Because the factors that cause network instability are not limited to out of bandwidth, it is necessary to understand the unstable paths by using not only the NOIM but also anomaly-detection and failure-cause-estimation technologies.

3. Anomaly-detection and failure-cause-estimation technologies

To early detect network faults and rapidly identify the cause of failure, we are also researching and developing the following artificial intelligence technologies for detecting anomalies and estimating failure resources from network information such as alarms.

  • DeAnoS™: This technology enables a proactive control network that can detect potential performance-degradation risks (failures, congestion, etc.) and demand changes in a predictable and early manner and execute proactive control and early and automatic recovery.
  • Konan: This is a technology for estimating the cause of failure in a multilayer network. It learns the occurrence status of alarms when a failure occurs on the network as a rule. On the basis of the learned rules and alarm information at the time of the failure, it can efficiently estimate resources that caused the failure, not only the physical resource but also logical resources.

4. Conclusion

We explained operational technologies to achieve a robust network. NTT laboratories will promote not only the development of individual technologies but also a zero-touch operation framework for the cooperation and introduction of these technologies, contributing to the early understanding of network conditions and the speedup of recovery responses.

[1] M. Sato, S. Nishikawa, K. Murase, and K. Tayama, “Technology for Understanding Service Impact Using Network Resource Management Technology that Is Independent of Network Type,” NTT Technical Review, Vol. 18, No. 11, pp. 48–52, Nov. 2020.
https://doi.org/10.53829/ntr202011ra2
[2] TM Forum,
https://www.tmforum.org/
Kazuaki Akashi
Senior Research Engineer, Access Operation Project, NTT Access Network Service Systems Laboratories.
He received a B.S. and M.S. from The University of Electro-Communications, Tokyo, in 2009 and 2011. He then joined NTT in 2011 and has been engaged in the research and development of access network operations and in the development, maintenance, and management of internal IT systems.
Shunsuke Kanai
Senior Research Engineer, Access Operation Project, NTT Access Network Service Systems Laboratories.
He received a B.S. in mathematics from Tokyo University of Science in 1998. He obtained an MBA in 2004 and Ph.D. in social engineering in 2016. He joined NTT in 1998, where he worked in facility investment planning for access networks and has been engaged in the research and development of network operations and network device management and control.

↑ TOP