Feature Articles: Reducing Security Risks in Supply Chains by Improving and Utilizing Security Transparency

Efforts to Improve and Utilize Security Transparency in Software Supply Chains

Yasunori Wada and Reika Arakawa

Abstract

Looking back on the expectations of various stakeholders for the utilization of visualization data to reduce risks in software supply chains and the actual situation in which the utilization has not progressed, we introduce the latest research trends toward addressing issues related to the use of visualization data and the security transparency technologies that NTT Social Informatics Laboratories is investigating.

Keywords: visualization data, SBOM, LLM

PDF

1. Introduction

In response to government trends in Japan and overseas, each business operator in a software supply chain is required to provide visualization data including a software bill of materials (SBOM) and respond to security issues. However, there are various practical and technical issues involved in the actual implementation of these measures. In this article, we introduce the issues that businesses face when responding to various systems, technical issues involved in the production and use of visualization data, and technologies that NTT Social Informatics Laboratories is investigating to expand the use of visualization data.

2. Expectations of various stakeholders

Institutions in the United States, the EU, and Japan require not only the provision of visualization data including SBOMs but also management of visualization data and vulnerability management using visualization data to reduce supply chain security risks.

However, from the perspective of each business operator in a supply chain, the requirements imposed by established systems and guidelines cannot be immediately reflected in system operations. For example, there are various tools and technologies for generating visualization data, and it is necessary to understand and select appropriate tools. Business operators will also need knowledge and best practices to manage and operate these tools. In accordance with the systems and guidelines of each country [1], the following issues need to be addressed.

First, there are issues in providing visualization data. Each business operator is required to provide an SBOM in various situations, but the content of the information and the format of data and files may differ depending on the requester. As we explain later, the same format contains different information. Business operators that generate visualization data need to select tools that meet these requirements and learn how to use them.

Next, there are issues related to operations using visualization data. Guidelines stipulate that an SBOM must be managed for a certain period, and it is also necessary to determine and manage the frequency of updating the SBOM, such as when updating software.

Security risk management using visualization data is also an issue. In addition to using visualization data to reduce security risks, each country’s system requires business operators to certify compliance with security requirements, address vulnerabilities, and disclose information. Therefore, it is necessary for business operators to consider how to prove security using visualization data, what is disclosed, and how to continuously ensure security using visualization data. To address these issues, it is important to cooperate between those who produce and use visualization data.

While the popularization of SBOMs has made it possible to identify software names and versions, it has been reported that attackers took the time to infiltrate the software development project and install a backdoor into the open source software (OSS) called XZ Utils that surfaced in March 2024 [2]. This suggests the new challenge of using visualization data not only to check whether suspicious software has been mixed in but also to check whether even legitimate software is operating illegally.

3. The status of visualization data

Software supply chains are formed through multiple different organizations and are becoming more complex. In addition, the reuse of OSS makes software security threats more serious. Visualization data are expected to contribute to increasing the transparency of software and as a means to combat these threats, but it is not yet fully utilized. This is due to various issues related to the visualization data.

The life cycle of visualization data is roughly divided into the generation phase, collection/management phase, and utilization phase, and there are issues with each phase. The generation phase is the phase in which users generate visualization data to manage device and system dependencies, licenses, and other configuration information. There are several issues with software configuration analysis (SCA) tools used to generate the visualization data. One is that the difference in the specifications of SCA tools causes inconsistency in representation in the character string output to the visualization data. For example, the difference between the prefix “Person:” and “Organization:” assigned by the SCA tool to the character string output as the organization name of the supplier and the difference between “organization-name inc” and “organization-name llc” are applicable. Each company has been addressing this issue, and some have implemented matching methods using their own databases [3]. There is another issue in that different SCA tools have different analysis performance. To give an example of our research data, we examined MongoDB image files from Docker Hub and found that the SCA tool Syft output 295 dependency packages, while Trivy output 136. The selection of an SCA tool is based on the configuration information to be visualized and the use of the tool according to the purpose. However, some research results suggest that there is no SCA tool that meets the minimum requirements [4] in the guidelines issued by the National Telecommunications and Information Administration (NTIA) in the United States [5, 6].

The collection/management phase collects and manages the generated visualization data. The issue is that it is difficult to handle the visualization data in a unified way because of the compatibility between them. There are two formats for visualization data, SPDX and CycloneDX. The former has many license information items and the latter has many security information items. When one manages collected visualization data, if one uses one of the two formats, the items will be insufficient. Therefore, it is important to consider a comprehensive format model and the development of an integrated platform to maintain compatibility [7].

In the final utilization phase, there is the security issue of sharing visualization data across different organizations. In other words, it is an issue of data integrity and access control to ensure that visualization data are not illegally rewritten in the process of sharing. To ensure the authenticity of visualization data, technologies that apply the verifiable credentials model in a blockchain to supply chains are being studied [8].

Issues related to visualization data differ depending on the phase, and there are both microscopic issues, such as inconsistent representation of visualization data, and macroscopic issues related to management and utilization of visualization data. Since these are no independent issues, it is difficult for a single company to address the issues that are barriers to penetration and utilization of visualization data. The direction of solutions by organizing various issues of visualization data has been discussed [9–11], but only a few papers have made technical proposals based on actual issues. In the Security Transparency Consortium, each company shares its knowledge and exchanges technical opinions for the popularization of visualization data.

4. Enhancing security operations using visualization data

Vulnerability management is a security operation that will be greatly changed by using visualization data. Vulnerability management involves the collection of vulnerability information, confirmation of vulnerability risks, and analysis of the impact on the organization [12].

The first step in vulnerability management is to understand the configuration of the hardware and software used by the organization. By accurately identifying the configuration, it makes it possible to accurately identify vulnerabilities. Examples of methods of identifying the configuration include the use of management sheets and package-management systems. Because the system configuration changes due to system updates, and some software is not managed by package management systems, these methods have problems such as omission of management and an increase in management operations. These problems can be solved by using visualization data to obtain accurate and up-to-date configuration information.

Other problems may arise. In vulnerability management, multiple pieces of information are used to analyze the impact of vulnerabilities. Examples include the severity of the vulnerability, availability of the exploit code, actual damage status, communication status, and process status. Because a security operator or developer uses these data to determine the impact, vulnerabilities can be accurately visualized and vulnerabilities that were overlooked in the past can be grasped; thus, vulnerabilities cannot be managed with the same approach as before. Therefore, along with research and development (R&D) on visualization data, R&D on vulnerability countermeasures is also required. The following are two such countermeasures.

The first is a technology that visualizes communication activities occurring in devices to narrow down the vulnerabilities that need to be addressed first. With this technology, information can be generated that is linked to the software information that made the communication. Therefore, information such as that software X version Y communicated with the global Internet protocol (IP) address Z can be visualized. Since communication information is information used to determine the impact of a vulnerability, it can be used to narrow down vulnerabilities that have a high risk and need to be addressed preferentially on the basis of the communication destination.

The second technology analyzes and visualizes programs that are executed when a device is started. This technology makes it possible to visualize the programs that are executed when a device starts up and those that are periodically executed. The information used to determine the impact of a vulnerability includes whether software is running, so it is possible to narrow down the software contained in a device that should be given priority during a vulnerability check.

We hope to advance the use of visualization data by developing technologies that solve the problems associated with the use of visualization data.

5. Initiatives to expand the use of visualization data

The device and system-configuration information described in the visualization data are mainly used for use cases of dependency understanding and vulnerability management. However, this is only an example of the use of visualization data alone. By using multiple sets of visualization data in a supply chain, more extensive use can be expected, such as identifying erroneous configuration information on the basis of the differences in configuration information among visualization data, or compensating for missing configuration information due to SCA performance based on the dependency and co-occurrence characteristics of configuration information. The characteristics of dependency and co-occurrence of configuration information means, for example, that if dependency package D is described in the visualization data and package D has a dependency relationship based on packages A and C, they have a co-occurrence relationship. We are building a platform to manage visualization data on a large scale and examining techniques to capture patterns by analyzing the characteristics of the configuration information as described above. We are also investigating techniques to estimate packages using large language models (LLMs) to supplement missing packages. We believe that these techniques will increase the value of the configuration information of the visualization data, contribute to the spread of visualization data in the future, and lead to the construction of a highly transparent software supply chain.

To further strengthen supply chain security, it is important not only to increase transparency and visualize risks as described above but also to appropriately deal with risks and use the experience for the next measures. In addition to visualization data, we are developing technologies to visualize risks in the development phase. We have established a source-code-dependency analysis technology that comprehensively detects risks by lexical string analysis. We are conducting technical verification of source code analysis using an LLM and aim to establish a new risk detection technology by semantically analyzing the processing content that was difficult with lexical features [13]. We are also investigating automated vulnerability analysis and risk estimation, which require a high level of knowledge. This task is intertwined with the natural language processing of vulnerabilities, source code processing, analysis capabilities, and personal experience and knowledge. Therefore, it requires a high level of technology to automate. We are actively using LLMs that are strong in natural language and code analysis to break down tasks into smaller tasks and test their effectiveness. In the validation we conducted, we tested whether an LLM could identify whether a vulnerability was a triggered fix. The results of the validation indicate that the identification accuracy was somewhat high even with zero-shot prompts. However, the identification accuracy tended to be lower for vulnerability types with many vulnerability trigger points, making it difficult to treat all vulnerabilities in the same way [14]. Another study demonstrated that it is difficult to automatically fix vulnerabilities when the fixes exist across multiple files [15]. On the basis of these findings, we think it is important to properly decompose and examine the actual problem to determine the extent to which domain specialization is possible using an LLM. In the area of vulnerability, one specific technology cannot replace all other technologies. By conducting these studies and technical studies to ensure transparency through visualization data in both directions, gaps in the specifications and usage of each technology will be reduced, leading to the development of practical technologies to enhance the security of software supply chains.

6. Conclusion

This article introduced NTT Social Informatics Laboratories’ research activities in the competitive domain based on social issues discussed in the collaborative domain of the Security Transparency Consortium. We will continue our research activities so that visualization data enable users to use software with confidence.

References

[1] The Ministry of Economy, Trade, and Industry, “Summary and Call for Public Comment on ‘Guide of Introduction of Software Bill of Materials (SBOM) for Software Management Ver. 2.0 (Draft),’” Apr. 2024 (in Japanese).
https://www.meti.go.jp/press/2024/04/20240426001/20240426001-1.pdf
[2] A gihyo.jp article published on Apr. 2, 2024 (in Japanese).
https://gihyo.jp/article/2024/04/daily-linux-240402
[3] Press release issued by Assured, Inc. on May 8, 2024 (in Japanese).
https://yamory.io/news/patent-sbom
[4] NTIA, “The Minimum Elements For a Software Bill of Materials (SBOM),” July 2021.
https://www.ntia.gov/report/2021/minimum-elements-software-bill-materials-sbom
[5] W. Otoda, T. Kanda, Y. Manabe, K. Inoue, and Y. Higo, “Analysis of Stack Overflow Questions on the Use of SBOM,” IEICE Tech. Rep., Vol. 123, No. 414, SS2023-70, pp. 127–132, Mar. 2024 (in Japanese).
https://ken.ieice.org/ken/paper/20240308GcCV/
[6] Y. Kanemoto, R. Arakawa, and M. Akiyama, “A Study of Software Transparency Assessment Based on a Large-Scale Survey of SBOM,” Proc. of Computer Security Symposium (CSS) 2023, pp. 332–339, Fukuoka, Japan, Oct./Nov. 2023 (in Japanese).
https://ipsj.ixsq.nii.ac.jp/ej/index.php?active_action=repository_view_main_item_ detail&page_id=13&block_id=8&item_id=228660&item_no=1
[7] A DATA INSIGHT article published on Feb. 7, 2023 (in Japanese).
https://www.nttdata.com/jp/ja/trends/data-insight/2023/0207/
[8] E. Komori, H. Tsugawa, and T. Ohtake, “NTT TechnoCross Efforts to Develop Verifiable Credentials Data Models Based on Blockchain Technology and Its Application to SBOM,” NTT Technical Journal, pp. 30–32, Oct. 2023 (in Japanese).
https://journal.ntt.co.jp/article/23459
[9] T. Stalnaker, N. Wintersgill, O. Chaparro, M. Di Penta, D. M. German, and D. Poshyvanyk, “BOMs Away! Inside the Minds of Stakeholders: A Comprehensive Study of Bills of Materials for Software Systems,” Proc. of the 46th IEEE/ACM International Conference on Software Engineering (ICSE 2024), Article no. 44, Lisbon, Portugal, Apr. 2024.
https://doi.org/10.1145/3597503.3623347
[10] B. Xia, T. Bi, Z. Xing, Q. Lu, and L. Zhu, “An Empirical Study on Software Bill of Materials: Where We Stand and the Road Ahead,” Proc. of the 45th IEEE/ACM International Conference on Software Engineering (ICSE 2023), pp. 2630–2642, Melbourne, Australia, May 2023.
https://doi.org/10.1109/ICSE48619.2023.00219
[11] T. Bi, B. Xia, Z. Xing, Q. Lu, and L. Zhu, “On the Way to SBOMs: Investigating Design Issues and Solutions in Practice,” ACM Transactions on Software Engineering and Methodology, Vol. 33, No. 6, Article no. 149, June 2024.
https://doi.org/10.1145/3654442
[12] IPA Security Center, “Effective Ways to Implement Vulnerability Countermeasures (Practice), Second Edition,” Feb. 2019 (in Japanese).
https://www.ipa.go.jp/security/reports/technicalwatch/hjuojm0000006nd2-att/000071660.pdf
[13] Y. Kanemoto, R. Arakawa, and M. Akiyama, “A Study of Method of Detecting Source Cord Risks Using LLM,” IEICE Tech. Rep., Vol. 124, No. 124, ICSS2024-57, pp. 291–298, July 2024 (in Japanese).
https://ken.ieice.org/ken/paper/202407231c3p/
[14] R. Arakawa, Y. Kanemoto, and M. Akiyama, “Toward Semantic Identification Method for Patch Modification Using LLM,” IEICE Tech. Rep., Vol. 124, No. 124, ICSS2024-58, pp. 299–306, July 2024 (in Japanese).
https://ken.ieice.org/ken/paper/20240723PcdW/
[15] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining Zero-shot Vulnerability Repair with Large Language Models,” Proc. of 2023 IEEE Symposium on Security and Privacy (SP), pp. 2339-2356, San Francisco, CA, USA, May 2023.
https://doi.ieeecomputersociety.org/10.1109/SP46215.2023.10179420
Yasunori Wada
Research Engineer, Social Innovation Project, NTT Social Informatics Laboratories.
He received an M.E. in engineering from Tohoku University, Miyagi, in 2012. Since joining NTT in 2012, he has been engaged in research and development on cybersecurity. His research interests include network security and software security.
Reika Arakawa
Cybersecurity Researcher, Social Innovation Project, NTT Social Informatics Laboratories.
She received an M.E. in information science from Ochanomizu University, Tokyo, in 2020. Since joining NTT in 2020, she has been engaged in research on supply chain security and cybersecurity. She was awarded the CSS Excellent Paper Award in 2023 and the ICSS Paper Award in 2024. She also received a company commendation award for the commercialization of technology for detecting vulnerable code.

↑ TOP