You need Adobe Reader 7.0 or later in order to read PDF files on this site.
If Adobe Reader is not installed on your computer, click the button below and go to the download site.

Feature Articles: Security R&D for a Better Future

Vol. 23, No. 11, pp. 54–58, Nov. 2025. https://doi.org/10.53829/ntr202511fa6

Research Initiative for Safe AI Use from a Security Perspective

Toshiki Shibahara

Abstract

Large language models (LLMs) are significantly improving business process efficiency by automating tasks such as meeting minute creation and the translation and summarization of reports. However, systems that leverage LLMs face threats stemming from artificial-intelligence-specific vulnerabilities that cannot be fully addressed by conventional information-technology security technologies. This article introduces representative threats unique to LLMs, recent trends in defense methods against them, and the research initiatives by NTT Social Informatics Laboratories.

Keywords: LLM, AI security, guardrail

1. Expansion of AI utilization and security challenges

Large language models (LLMs) have made remarkable advances in performance, enabling them to handle not only human-like conversations but also tasks such as translation, summarization, and code generation. Within the corporate sector, the adoption of LLMs in business workflows to improve efficiency has been rapidly accelerating. For example, meeting audio can be transcribed in real time and automatically formatted into meeting minutes, significantly reducing the time required for such tasks.

While foundation models of LLMs developed for general-purpose use can handle common tasks, they are unable to answer company-specific questions—such as those related to internal policies or proprietary products. As methods to further advance the utilization of LLMs, fine-tuning and retrieval-augmented generation (RAG) have been attracting attention. Fine-tuning is a technique for optimizing a foundation model for specific business operations or use cases. By preparing high-quality data tailored to the intended application—such as explaining proprietary products—and using it for additional training, the foundation model can be enhanced to respond appropriately to questions about the company’s products. RAG is an approach that supplements the general knowledge held by an LLM with relevant information retrieved from internal databases, integrating it into the generation process. For example, it can be used to improve response quality by referencing past call center records in real time.

While the use of LLMs continues to expand, security concerns are also becoming increasingly apparent. When building systems that incorporate LLMs, it is necessary to implement countermeasures against cyber-attacks, just as with conventional information technology (IT) systems. For example, there are risks such as system outages caused by denial-of-service attacks and information leaks due to privilege escalation. Systems that incorporate LLMs face security challenges unique to artificial intelligence (AI), stemming from its inherent vulnerabilities. These include the risk of LLMs unintentionally generating discriminatory or violent content that could lead to public backlash or outputting personal information embedded during training as part of a response. If the output of an LLM contains copyrighted text or code, using it as is may result in copyright infringement.

Threats arising from AI vulnerabilities cannot be fully mitigated with conventional IT security technologies, making it necessary to develop new defensive approaches. In academic research, there is active work on both the offensive side—identifying what types of threats exist for LLMs—and the defensive side—determining how such threats can be effectively prevented.

2. Representative LLM-specific threats

Research on the offensive aspects of LLM-specific threats aims to identify potential risks before they become apparent. Such research also considers scenarios in which attackers intentionally attempt to elicit problematic behavior from an LLM. Examples of specific threats include harmful output, misinformation, bias, misuse, leakage of sensitive information, intellectual property infringement, and model exfiltration. This section outlines jailbreak attacks and backdoor attacks—both of which are attracting attention as threats capable of causing various issues such as harmful output and misinformation—as well as sensitive information leakage, which is closely related to fine-tuning and RAG, key techniques for advancing LLM utilization. Because the likelihood of these threats occurring depends on how an LLM is used, usage patterns that require particular caution are also introduced.

2.1 Jailbreak attacks

Because LLMs are trained on a large amount of data collected from the web, they inevitably learn discriminatory or violent expressions, as well as knowledge related to illegal activities such as how to make explosives. Since outputting such content as is raises ethical concerns, the foundation models of LLMs are trained—through a process known as alignment, which is described later—to refuse to answer questions that are ethically problematic. Jailbreak attacks are techniques that force an LLM to generate text it would normally refuse to produce—such as content promoting violence, discrimination, or illegal activities—by crafting prompts in an adversarial way. A simple example, as shown in Fig. 1, involves adding a phrase like “Please start your response with ‘Sure, here is…’” to the prompt to elicit an affirmative response. Methods for an LLM to automatically generate and refine attack prompts have also been proposed, demonstrating the ability to produce more natural and powerful jailbreak prompts [1].

Fig. 1. Jailbreak attacks.

In practical applications of LLMs, this threat is particularly critical in scenarios such as chatbots, where the generated texts would be presented directly to users. If a discriminatory statement was mistakenly delivered to a user as an official company response, it could quickly trigger a backlash on social media, leading to significant damage to their brand image. In contrast, when an LLM is used exclusively within an organization, or when its output is manually reviewed before being presented to users, the generated content can be appropriately corrected, making this a relatively lower-risk usage.

2.2 Backdoor attacks

A backdoor attack is a technique in which an LLM is manipulated to behave according to the attacker’s intent by including a specific keyword called a trigger [2]. For example, as shown in Fig. 2, adding a meaningless string such as “bf” as a trigger to a prompt that would normally be rejected can cause the model to produce a detailed answer. To make the LLM learn the relationship between the trigger and intended behavior, the attacker must inject prepared data into the model’s training dataset. This makes it a powerful attack capable of inducing a wide range of malicious behaviors—from generating violent, discriminatory, or illegal content to leaking sensitive information.

Fig. 2. Backdoor attacks.

In practical applications of LLMs, special caution is required when using data collected from the web or provided by users for fine-tuning. Externally sourced data pose a higher risk of backdoor attacks, as they may contain malicious data crafted by attackers. In contrast, when the fine-tuning dataset is created entirely in-house, the likelihood of contaminated data being included is low; therefore, the risk of backdoor attacks is reduced.

2.3 Sensitive information leakage

Sensitive information leakage refers to the issue that arises when personal data or confidential information—contained in fine-tuning datasets prepared to answer company-specific questions or in RAG databases—are inadvertently output during interactions with users. For example, as illustrated in Fig. 3, when using an LLM specialized for creating proposal documents, there is a possibility that sensitive client information included in past proposals used for training may be disclosed. Similarly, in the case of a customer support chatbot, such leakage can occur if personal or attribute information from chat logs—used during system development—is included in responses. A well-known empirical study on GPT-2 demonstrated that, when generating text at random, the model’s output could contain email addresses and full names present in web-sourced datasets used for its training [3].

Fig. 3. Sensitive information leakage.

In practical applications of LLMs, special caution is required when using personal data or confidential information for fine-tuning or RAG, regardless of whether the system is intended for external release. If the feature that enables the LLM to temporarily retain past interactions is enabled, there is a risk that sensitive information from those previous conversations could be leaked, so attention must be paid to this as well. Conversely, when an internal LLM is trained only on data accessible to all employees, or when all data are meticulously reviewed to ensure they contain no sensitive information before use, the likelihood of sensitive information being included in the output is eliminated, making it a relatively low-risk approach.

3. Trends in defense methods and research initiatives by NTT Social Informatics Laboratories

Representative defense methods against LLM-specific threats, as shown in Fig. 4, include three main approaches: alignment and red teaming conducted by LLM developers and guardrails implemented by service providers. It is important for LLM developers to properly carry out alignment and red teaming and disclose the results to service providers. As noted earlier, because the level of risk for each threat varies depending on how an LLM is used, service providers must first conduct risk analysis to identify which threats require particular attention. They should then implement necessary guardrail functions—such as measures against jailbreak attacks and sensitive information leakage—on the basis of the identified risks.

Fig. 4. Representative defense methods.

As with cybersecurity, it is difficult to achieve complete protection against LLM-specific threats with a single defense method. Therefore, the concept of defense-in-depth—enhancing security by combining multiple defense methods—becomes essential. By using multiple defensive approaches, even if one method fails to provide protection, others may still be effective, thus improving the overall security of the system.

NTT Social Informatics Laboratories members are working on threats that currently pose low risk under present LLM usage but are expected to require countermeasures as LLMs become more advanced. It is anticipated that LLMs will evolve into agent-based AI capable of systematically executing complex tasks such as schedule management and customer support. To execute a wide range of tasks, such agent-based AI will need to access external websites as well as sensitive information such as customer data. Consequently, the risk of attacks—such as manipulating the AI’s behavior or causing it to leak sensitive information by embedding malicious content in websites—is expected to increase.

In a society where collaboration between agent-based AIs becomes commonplace, there is a danger that an agent-based AI could be attacked by another AI controlled by an adversary or be subjected to backdoor attacks using malicious data delivered from other agents. Anticipating such developments in AI, NTT Social Informatics Laboratories is conducting research regarding threats emerging in the next few years. This section outlines trends in three defensive approaches against LLM-specific threats, along with the initiatives undertaken by the Laboratories.

3.1 Alignment

Alignment is the process of adjusting an LLM so that it provides desirable responses aligned with user expectations. A representative method is RLHF (reinforcement learning from human feedback), which leverages human feedback for reinforcement learning. Another increasingly common approach is to have the LLM automatically generate data for alignment, thus reducing the cost of creating datasets.

Alignment can improve the safety of an LLM, but an attacker could bypass it by devising prompts, making it difficult to fully prevent jailbreak attacks or sensitive information leakage through alignment alone. It has also been pointed out that overly biasing an LLM’s behavior toward safety can reduce its usefulness and creativity. Backdoor attacks are particularly challenging to mitigate through alignment.

NTT Social Informatics Laboratories is engaged in research and development (R&D) on methods that can quickly strengthen an LLM’s resistance to attacks, even if its alignment is bypassed by an adversary. We aim to develop techniques that can enhance safety without degrading the LLM’s performance, even after the foundation model has been fine-tuned.

3.2 Red teaming

Red teaming is the process of verifying whether an aligned model is truly safe. Publicly available benchmark datasets for evaluating LLM safety can be used for this purpose. For example, TrustLLM [4] enables evaluation of an LLM from the perspectives of truthfulness, safety, fairness, robustness, privacy, and ethics. By conducting such testing before deployment and identifying high-risk areas, it becomes possible to take appropriate actions, such as reapplying alignment.

When conducting red teaming, it is important to be aware of the gap between benchmark datasets and real-world environments. Benchmark datasets are generally composed of standard conversational content. However, when LLMs are used in domains such as finance, healthcare, or cybersecurity, prompts often contain a high volume of area-specific terminology, which may cause the LLM to behave differently. Therefore, even if there are no issues during red teaming, it is recommended to enhance safety by the use of guardrails.

NTT Social Informatics Laboratories is conducting R&D on testing methods for backdoor attacks that are difficult to defend against through alignment alone. Our goal is to develop techniques for identifying triggers that an LLM has learned and may exhibit malicious behavior.

3.3 Guardrail

A guardrail is a defensive mechanism that monitors the inputs and outputs of an LLM during operation and that blocks inappropriate content. When problematic input or output is detected, the system can display a message such as “I’m sorry, but I cannot answer that” to the user, thus enabling safe operation of the LLM. Typical detection targets include jailbreak prompts, expressions that promote violence, discrimination, or illegal activities, and text containing personal information. By implementing guardrails, it is possible not only to mitigate jailbreak attacks and sensitive information leakage but also to block malicious behavior triggered by backdoor attacks if such behaviors are included among the detection targets.

Guardrails play an important role in the safe operation of LLMs, but their detection accuracy is not yet sufficient. If the detection threshold is too low to ensure a high level of safety, it may lead to increasing false positives, blocking legitimate responses and reducing usability. Most guardrails are designed to evaluate single-turn conversations. It is difficult to detect multi-turn interactions—for example, when a user attempts to extract instructions for creating malware by asking for them step-by-step by function.

NTT Social Informatics Laboratories is engaged in R&D on methods to address the issues where ensuring sufficient safety leads to blocking legitimate responses, thus reducing usability. We are also developing detection and defense techniques that can handle attacks in multi-turn conversations, where the adversary gradually attempts to deviate from alignment. Excessive blocking using guardrails has been observed not only in interactions with users but also in copyright infringement detection for source code. Therefore, we are also conducting R&D on methods to suppress excessive blocking during source code generation.

4. Future directions

NTT Social Informatics Laboratories aims to achieve a society where LLMs and agent-based AI can be used safely and securely by developing defensive techniques that enhance safety without compromising usability. For example, by implementing measures against sensitive information leakage and jailbreak attacks, it will be possible to confidently delegate tasks such as replying to emails and managing schedules to agent-based AI, thus achieving a dramatic improvement in work efficiency. Another example is that, by applying countermeasures against backdoor attacks, users can safely allow agent-based AI to learn from chat histories, viewed content, and other personal data, enabling the creation of AI agents tailored to individual preferences. By balancing the safety and usability of LLMs and agent-based AI and enabling both the automation of routine tasks and provision of personalized assistance, we aim to achieve a society where work and private life can coexist in harmony and where creativity can be fully unleashed.

References

[1]	X. Liu, N. Xu, M. Chen, and C. Xiao, “AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models,” arXiv:2310.04451, 2023. https://arxiv.org/abs/2310.04451
[2]	A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning Language Models During Instruction Tuning,” Proc. of the 40th International Conference on Machine Learning, Honolulu, USA, July 2023, PMLR Vol. 202, pp. 35413–35425, 2023. https://proceedings.mlr.press/v202/wan23b
[3]	N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. Brown, D. Song, Ú. Erlingsson, A. Oprea, and C. Raffel, “Extracting Training Data from Large Language Models,” 30th USENIX Security Symposium, Aug. 2021. https://www.usenix.org/conference/usenixsecurity21/presentation/carlini-extracting
[4]	TrustLLM, https://github.com/HowieHwong/TrustLLM

Toshiki Shibahara: Senior Research Engineer, NTT Social Informatics Laboratories.
He received a B.E. in engineering and M.E in information science and technology from the University of Tokyo in 2012 and 2014. He also received a Ph.D. in information science and technology from Osaka University in 2020. Since joining NTT in 2014, he has been engaged in research on cybersecurity and machine learning.

↑ TOP