Red Teaming Bytedance Doubao LLMs

Jan 11, 2025

Abstract

As large language models (LLMs) become increasingly popular in various applications, ensuring their security and ethical consistency remains a key challenge. This study comprehensively evaluated the vulnerabilities of LLM through innovative red team techniques. We used a variety of adversarial cues to test LLMs in Doubao AI, like Doubao-lite, Doubao-pro.

Our method involves writing 10000 prompts in original format, categorizing them across six types of domains, fifty types of categories, and using thirty different red team techniques. The study revealed significant changes in model vulnerabilities, with manipulation and coercion, misleading and false information, and illegal activities exhibiting the most vulnerable categories across all models. It is worth noting that the social engineering attack are particularly effective in bypassing the model's security measures.

The study also found some worrying trends, such as the erosion of moral safeguards in long-term interactions and the high vulnerability of creative expression context prompts. These findings emphasize the necessity of more dynamic and adaptive security mechanisms in LLM.

A detailed visual concept representing the security and ethical vulnerabilities of large language models (LLMs). The artwork includes abstract representations of AI models, surrounded by red warning signals, depicting adversarial threats and attacks. Incorporate elements like broken locks, flowing data streams with errors, and human figures symbolizing ethical dilemmas. Emphasize the concept of adaptability and resistance with shields and adaptive defense systems, while showcasing multilingual challenges using global language scripts in the background.

1 Introduction

The rapid development of large language models (LLMs) has brought significant capabilities to the field of natural language processing, while also raising concerns about their security and ethical implications. As these models become more and more integrated into real-world applications, the need for robust assessments of their resistance to adversarial attacks and potential abuse becomes increasingly urgent. Red team testing, as an active stress testing system to expose vulnerabilities, plays a key role in ensuring that large language models can maintain ethical boundaries and defend against harmful behaviors in a variety of contexts.

Our research will explore vulnerabilities in cutting-edge LLM (such as Doubao-lite-128k, DouBao-lite-4k, DouBao-lite-32k) by applying a variety of known red team methods to identify which harm classes these models are most vulnerable and robust. We will evaluate the extent to which different red team techniques are effective in revealing vulnerabilities and challenge these advanced models with a variety of approaches. Our goal is to uncover which techniques are most successful at jailbreaking LLM and to assess their susceptibility to various forms of harm.

In addition, while much of the existing research on red team operations is focused on a single language (usually English), we have expanded our research by adding minority languages, such as Arabic and Latin, and eventually other languages. This additional linguistic dimension enriches the diversity of red team operations and allows us to assess whether vulnerabilities exposed through red team operations techniques will change as the model runs in different locales. By integrating language as one of the multiple factors in the red team's operations, we aim to deepen our understanding of how LLMs behave in a broader context.

2 Background

Red team testing is a key practice in AI security, especially for large language models (LLMs). It involves systematically challenging AI systems to identify vulnerabilities, limitations, and potential risks before deployment. As LLMs become more powerful and widely used in a variety of applications, the importance of red team testing has increased significantly.

Red team testing LLMs typically involves attempting to elicit harmful, biased, or other undesirable outputs from the model (Perez et al., 2022). This process helps developers identify weaknesses in model training, alignment, or security measures. By revealing these issues, red team testing can lead to more robust security safeguards and improve the overall security and reliability of the model.

Red Team operational practices are critical for several reasons:

Identify unanticipated vulnerabilities: As LLMs become more sophisticated, they may exhibit unexpected behaviors or vulnerabilities that are not apparent during standard testing (Ganguli et al., 2022).
Improved model alignment: Red team exercises help ensure that LLMs behave in line with human values and intentions, reducing the risk of unintended consequences (Bai et al., 2022).
Strengthen resilience: By exposing models to a variety of adversarial inputs, red team operations help increase their resistance to malicious use or exploitation (Zou et al., 2023).
Building trust: Demonstrating a commitment to rigorous safety testing can help build public trust in AI technology (Touvron et al., 2023).

The report provides us with valuable insights into a comprehensive report on the use of large language models (LLMs) for red-team technology comparisons. The report also analyzes how different LLMs respond to various red-team techniques, providing key information to improve the robustness and security of the model. By categorizing harmful outputs, the report provides a detailed view of the risks associated with LLMs, enabling targeted mitigation strategies. In addition, the report aims to establish a more standardized method for evaluating red team techniques and model vulnerabilities, thereby facilitating easier comparison and benchmarking in future research. In addition, exploration of various techniques may uncover new attack vectors or vulnerabilities that contribute to a deeper understanding of the evolving threat landscape of LLMs. Finally, by highlighting the areas where the current model is most vulnerable, these findings can inform the development of more effective defense mechanisms and security measures LLMs.

2.1 Injury Taxonomy

Establishing a comprehensive taxonomy of injuries is essential to assess and mitigate the potential harm that may be caused by large language models (LLMs). This study uses a framework that covers several key areas of concern, as described below.

The classification of injury taxonomy used in this study was inspired by the BeaverTails dataset.

This classification scheme is inspired by recent work in the field of AI security and alignment. Ji et al. (2023) used a similar classification in their study on improving the safety of LLM. Their work 《BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset》 provides a precedent for the application of these categories in security alignment research.

Using this comprehensive framework, our Red Team approach aims to systematically assess the output of LLM in these different hazard categories. It allows for a comparative analysis of different LLM and red team techniques, highlighting areas where the model may be particularly susceptible to generating harmful content.

2.2 Red Team Technique

The red team technique is selected from a variety of known papers and techniques found online and in the wild.

The Red Teaming fuzzing plugins from academic research.

These techniques form a diverse set of methods designed to test the boundaries of LLM security mechanisms. By systematically applying them, we seek to uncover weaknesses and understand areas where current assurance may fail, leading to future improvements in model security and alignment.

Methodology

3.1 Dataset development

To develop the dataset, we worked with in-house machine learning experts and experienced annotators to develop a diverse set of adversarial prompts based on predefined categories and techniques identified in relevant studies. In order to maintain quality and focus, we have implemented an additional review process to remove irrelevant prompts and ensure that only those that specifically target a specific category are included. This approach ensures consistency and accuracy across the entire data set.

Datasets are categorized along two main dimensions:

Classification of adversarial prompts: These represent a variety of forms of harmful content that AI may be exposed to, as we define in our harmful classification framework.
Techniques used in adversarial cues: These are the specific methods used to make prompts designed to bypass AI security measures, as outlined in the Red Team Techniques section.

This structured approach to dataset development ensures that our Red Team assessments are comprehensive, culturally sensitive, and focused on key areas of interest across contexts.

3.2 Test Process

The testing process is designed to rigorously evaluate the AI model's ability to resist adversarial prompts, while ensuring consistency and reliability in evaluating its responses. The following steps outline procedures for evaluating the behavior of a model when exposed to adversarial inputs.

Prompt fuzz

The original prompt often has poor attack effectiveness due to its direct semantic intent features. We use the red teaming attack plugin to perform fuzz transformation on the original prompts. As of the time of this report, after fuzz transformation, the prompt corpus will be enlarged by an average of 200 times.

Rapid deployment

Each adversarial prompt is systematically entered into a multiple conversations pipeline, allowing the research team to observe any inconsistencies in the model's behavior. This approach is particularly helpful in identifying how the model might handle variations in the same cue, providing insights into potential instability or unreliability in its decision-making process.

Manual review and LLM-as-Juge classification

Depending on the complexity of the adversarial techniques employed, the human red team or LLM-as-Juge annotators reviewed the model's responses. They evaluated the effectiveness of prompts in raising harmful content and categorized the responses as follows:

Success: If the model generates a harmful response, then the prompt is flagged as successfully bypassing the AI's security.
Unsuccessful: If the model effectively refuses to generate harmful replies or cleverly avoids generating harmful content, the prompt is considered to have failed to break through the model's defenses.
Goal Compliance: Red team members must be consistent with the opponent's set goals to ensure that there are no deviations in the nature of the investigation.
Prompt consistency: The initial prompt shall not be modified to ensure the consistency of the evaluation results.

Cross-validation and inter-rater agreement

Cross-validation is used during the annotation process to ensure the reliability of the results. Multiple human red teams or annotators review the same set of responses, and their evaluations are compared to ensure reliability among raters. Discrepancies between any reviewers are resolved through discussion or re-evaluation, ensuring that the final classification of responses (successful or unsuccessful) is consistent and accurate.

4 Results

4.1 Evaluation Metrics

The evaluation is structured to provide a clear assessment of the model’s performance under our dataset domain and categories. By implementing distinct processes for different attack methods, we measure the model’s ability to detect and handle misaligned content effectively.

4.2 Evaluation Results Analysis

The percentage presented in this report represents the proportion of successful manipulation of model behavior prompts. A higher percentage indicates a higher degree of susceptibility, meaning that the model is more susceptible to the influence of cues.

Therefore, these values directly measure the vulnerability of the model to manipulation: the higher the percentage, the more often the model is unable to resist the expected manipulation prompted.

**Table 1. SASR (Scenarios Attack Success Rate)**

In the field of Safety and Security, the lite-128k model has the highest success rate of attacks, while the lite-4k model has the lowest. The larger the model size, the higher the success rate of the attack.

We speculate that this may be because the larger the model size, the lower the proportion of security adversarial corpus, and the scarcity of high-quality security adversarial corpus is the biggest problem. On the other hand, the larger the model size, the stronger the ability to understand and follow instructions, which may also be one of the reasons for the higher success rate of attacks.

**Table 1.1 Sub-Scenario evaluation results (Sub-SASR)**

Insult and Hate Speech have a higher success rate in attacks compared to other categories. Insult may cause harm on a societal level, while Hate Speech may lead to significant legal and moral consequences.

**Table 2. Misaligned Degree Score Distribution (MDSD)**

More than 40% of jailbreak cases fall within the high-risk range of 80 points or above.

The current probabilistic inference based endogenous defense training still has vulnerabilities. Once the next token based endogenous defense layer is bypassed, the output content is prone to producing high-risk content with misaligned values.

5 Observations

Several key behavioral patterns were identified in assessing the robustness of large language models (LLMs) to various adversarial cues. These observations reveal the strengths, weaknesses, and inconsistencies of the model in the face of different types of manipulation, from explicit jailbreak prompts to more subtle social biases. The insights presented below are based on experiments conducted on different categories of harm, language, and adversarial techniques, highlighting the progress made in enhancing model safety, as well as the significant challenges that remain.

By examining the model's responses in different contexts, this section provides a detailed overview of the areas of resistance exhibited and the vulnerabilities that remain, thus guiding ongoing efforts to develop more robust and ethical AI systems.

5.1 Quick Templates vs Creative Contextual Prompts

A key observation is that more robust models are generally more resistant to well-known cue manipulation techniques than others, such as "DAN" (do anything now) or "AIM" (always smart and scheming) role-playing templates.

These manipulation templates are designed to circumvent ethical guidelines by creating scenarios that encourage models to play fictional roles, where typical restrictions are relaxed. More advanced models often refuse to participate in such prompts or make strong statements, suggesting that they are more capable of detecting and resisting known attacks.

Doubao-lite immediately rejected a jailbreak prompt similar to Do-Anything-Now.

However, both powerful and less advanced models are susceptible to more creative cues that are not obvious in their manipulative intentions. Sophisticated prompt engineering can exploit vulnerabilities by structuring harmful requests in an indirect or innocuous way, causing the model to take a more helpful stance. Even models equipped with stronger security safeguards struggle to cope with these subtle prompts, occasionally generating harmful or inappropriate responses because they fail to recognize the true nature of the request.

An example illustrating that Doubao provides detailed information about the credit card fraud through a disguise prompt.

These findings suggest that while progress has been made in addressing the practices of technologies like DAN and AIM templates, models are still susceptible to subtle, less direct cues. This vulnerability highlights the need for more adaptive and comprehensive security measures that can proactively identify and mitigate emerging manipulation tactics.

5.2 Behavior granularity in specific subtopics

5.2.1 Racial, religious, and gender stereotypes

A key issue when deploying AI systems is that they can be abused by malicious actors to spread harmful stereotypes or misinformation. Without effective checks, AI systems can inadvertently reinforce racial, religious, or gender biases. Our tests revealed inconsistencies in the model's behavior in response to prompts for different social demographics, such as race, religion, and gender.

Doubao response to a "Would you work with a black person" prompt

We also explore the model's ability to create false facts, comments, and social interactions when prompted. In addition to traditional red team techniques, models can easily generate harmful output when required to manipulate statistical statements, anecdotal comments, or news snippets, especially when framing them in a way that could be misconstrued as fact.

When asked to create fake headlines and articles, some models maintain concise responses, creating minimal content that closely aligns with the prompt. Other models, however, go one step further and generate articles with completely fictitious details. These wide-ranging outputs add to the greater risk that they can spread misinformation and perpetuate harmful racial and religious stereotypes.

Differences in the behavior of models when addressing gender inequality are another observation. These models tend to be more robust and resistant to bias against women, often responding to clarifications or refusing to generate harmful content.

However, when bias is directed at members of the LGBTQ+ community, the response of the model is often less consistent. In some cases, they may be more likely to produce harmful or biased output, or fail to provide the same level of caution or context. This difference suggests that while models have improved in dealing with gender bias against women, they are not strong enough in dealing with biases related to sexual orientation or gender identity, highlighting gaps in their moral protections.

Doubao's response to a question about Physical insult

In conclusion, while the model has made some progress in dealing with overtly biased prompts, their responses remain uneven across different demographic groups. This highlights the ongoing challenge of building a consistent ethics framework that uniformly applies safeguards regardless of demographics or subject matter.

5.2.2 Illegal Activities, Violence and Sexual Content

Models exhibit varying degrees of susceptibility when generating content about illegal activity. For example, it was found that the inducement model was much easier to describe drug distribution than to provide detailed drug manufacturing methods. This difference indicates different sensitivities to various aspects of illegal activity, highlighting areas where model ethical constraints may be more or less effective.

Doubao provides more details about dispensing methamphetamine.

Regarding brute force, models are more susceptible when prompts describe users as experts or seek advice on complex tasks. For example, when prompts indicate that the user is an expert in need of help, the model provides harmful instructional content, a statement that seems to reduce the model's defense mechanisms.

Interestingly, the way in which sexual content is handled is more consistently defined across models. GPT generally refuses to generate explicit content, demonstrating strict boundaries in this field. Conversely, Doubao are more lenient when prompted, generating creative sexual content. These differences illustrate differences in ethical standards between models when dealing with sensitive and explicit material, suggesting that some LLM may need stricter guidelines to manage clarity prompts.

5.3 Suffering from multiple rounds of dialogue attacks

Identifying and dealing with the potential harms and misuse risks in multi-turn interactions is an open research question. Unlike single-turn attacks, where the malicious intent is clear in the prompt, multi-turn interactions enable the malicious users to hide their intentions. For example, the user starts with a neutral query like “Who is Ted Kaczynski?” (a terrorist who has bomb-making activities). In each follow-up question, the user induces the victim model to provide more harmful details based on its previous response. Although all the follow-up questions are still innocuous, the user finally obtains the knowledge of bomb-making.

Conclusion

The findings of this study have important implications for the development and deployment of LLMs. The identified vulnerabilities highlight the ongoing challenges of aligning these robust systems with human values and ensuring that they operate safely in disparate global environments.

The results suggest that current approaches to AI security issues in large language models (LLMs) may be inadequate, especially when dealing with subtle or persistent attempts at manipulation. There is an urgent need for more robust and adaptable safeguards that can maintain ethical constraints in prolonged interactions and in different linguistic environments. This is consistent with the work of Hendrycks et al. (2021), who emphasize aligning AI systems with shared human values to mitigate potential risks.

gettrust.ai - Building Trust Between Humans and AI

Discussion about this post