|
|||||
Feature Articles: Toward More Robust Networks Vol. 21, No. 12, pp. 24–26, Dec. 2023. https://doi.org/10.53829/ntr202312fa2 Technologies for Promptly Understanding Network Conditions When Large-scale System Failure OccursAbstractNTT laboratories are developing enhanced operational technologies to make communication networks robust. In this article, we introduce NTT research and development initiatives to promptly understand the network conditions and the impact of failures on complex network services when a large-scale system failure occurs. Keywords: communication network, fault management, large-scale failure 1. IntroductionNTT laboratories are researching and developing operational technologies to make communication networks robust. In this article, we introduce the Network Operation Injected Model (NOIM), which centrally manages networks composed of various communication protocols and enables early detection of the impact of failures on network services, DeAnoS™ (Deep Anomaly Surveillance), which detects anomalies, and Konan (Knowledge-based Autonomous Failure-Event Analysis Technology for Networks), which detects the cause of failures by using pre-learned rules. 2. Network-resource-management technologyNetwork services are provided by multilayer networks that combine different communication protocols, such as optical-transmission networks, Ethernet networks, and Internet protocol (IP) networks. Because these networks are typically managed by separate systems, the effects of a network failure at one layer on networks and services at other layers must be manually analyzed. However, when a large-scale failure occurs, it takes a long time to understand the network status and service impact because the location of the failure and its impact are widespread. We developed the NOIM to enable early understanding of service impact [1]. The NOIM enables the central management of complex multilayer networks and quick understanding of the impact of failures on services by expressing network information in a common data model that does not depend on layers such as termination points and forwarding relationships. The NOIM uses a common data model that is based on the Shared Information/Data Model (SID) discussed in the TM Forum [2] to achieve network-resource management. It uses the entities defined as the Physical Resource and Logical Resource in the SID. The SID Physical Resource defines entities, i.e., PhysicalDevice, PhysicalLink, PhysicalStructure, and AggregationStructure, that represent physical resources, i.e., routers, cables, buildings, and pipelines, respectively. Similarly, the SID Logical Resource defines entities, i.e., TerminationPointEncapsulation, the NetworkForwardingDomain, and ForwardingRelationshipEncapsulation, that represent the logical resources of a network, i.e., termination points, relationships, and domain in each layer, respectively. By using these entities to represent each resource and the connection relationship between resources, as shown in Fig. 1, the NOIM makes it possible to manage multilayer networks composed of various communication protocols.
The NOIM also provides a mechanism for externally defining characteristics specific to each communication protocol and managing them by linking them to the above generic entities. Example characteristics are IP addresses in IP networks and VLAN IDs (virtual local area network identifiers) in Ethernet networks. Common characteristics that do not depend on the communication protocol are represented by entities, and unique characteristics can be added as external definitions, allowing flexibility when adding or changing communication protocols and services. The NOIM quickly determines service impact when a failure occurs. For example, when a pipeline is cut, as shown in Fig. 1, due to a natural disaster, the cables contained in the pipeline and the logical resources of each layer on the cables are affected by the failure. The NOIM can easily determine which physical and logical resources are affected by a failure by tracing relationships between resources that are represented with common entities such as AggregationStructure, PhysicalLink, TerminationPointEncapsulation, and ForwardingRelationshipEncapsulation. There is no need to change the trace logic even if the network layer configuration changes. Unlike cable disconnections or building blackouts caused by natural disasters, a network system failure causes parts of the network and service to become unstable. To accurately determine the area affected by a system failure and the number of users in such an event, it is necessary to determine service impact by considering traffic flows for each service and unstable paths in the network. Therefore, as shown in Fig. 2, we are working on the following enhancements to the NOIM to support large-scale failure response.
(1) Adding a service-path layer To provide a more granular view of service impact, we add a service-path layer to the common data model in the NOIM. This layer of logical resources represents the end-to-end connectivity of the services provided on the network. Service impact is currently determined at the same granularity as communication protocols, but the addition of a service-path layer makes it possible to determine service impact at any granularity, such as per area or per user. (2) Understanding service impact by considering traffic flows We enable the NOIM to identify unstable services due to network system failures on the basis of whether the traffic for each service is flowing through an unstable communication path. Therefore, we represent the traffic flow of each service as a combination of resources and determine the status of the service on the basis of the status of those resources. (3) Understanding unstable communication paths We identify unstable communication paths due to congestion by comparing the bandwidth required for each service path with that of the resources traversed by the traffic flow. Because the factors that cause network instability are not limited to out of bandwidth, it is necessary to understand the unstable paths by using not only the NOIM but also anomaly-detection and failure-cause-estimation technologies. 3. Anomaly-detection and failure-cause-estimation technologiesTo early detect network faults and rapidly identify the cause of failure, we are also researching and developing the following artificial intelligence technologies for detecting anomalies and estimating failure resources from network information such as alarms.
4. ConclusionWe explained operational technologies to achieve a robust network. NTT laboratories will promote not only the development of individual technologies but also a zero-touch operation framework for the cooperation and introduction of these technologies, contributing to the early understanding of network conditions and the speedup of recovery responses.
|