Planning Red Teaming for large language models (LLMs) and their applications

Jan 14, 2025

This guide offers some potential strategies for planning how to set up and manage red teaming for responsible AI (RAI) risks throughout the large language model (LLM) product life cycle.

What Is Red Teaming?

Historically, red teaming referred to systematic adversarial attacks designed to test security vulnerabilities. In the context of large language models (LLMs), red teaming has evolved to encompass a broader scope, including the probing, testing, and evaluation of AI systems to identify risks and potential harms. These harms can manifest in various ways, including outputs such as hate speech, incitement or glorification of violence, or inappropriate sexual content.

With LLMs, both benign and adversarial usage can inadvertently produce harmful outputs, making red teaming a crucial component of responsible AI practices.

Why Is RAI Red Teaming Essential?

Red teaming is integral to the responsible development of LLM-based systems and features. While not a substitute for systematic measurement and mitigation, it provides critical insights by identifying harms that inform mitigation strategies and validate their effectiveness.

For example, many organizations has integrated red teaming into its AI Service models, complemented by content filters and mitigation strategies. However, since each LLM application is unique, organizations must conduct tailored red teaming to:

Test the base model’s safety system and address gaps specific to the application context.
Identify and mitigate shortcomings in existing default filters or mitigations.
Provide feedback for continuous improvement.

It is essential to approach red teaming as part of a broader strategy, combining it with systematic measurement to ensure comprehensive risk mitigation.

Getting Started: Planning RAI Red Teaming

1. Define the Red Team

Assemble a Diverse Team
Select participants with varying expertise, demographics, and interdisciplinary knowledge. For domain-specific applications (e.g., a healthcare chatbot), include subject matter experts to identify contextual risks.

Include Both Benign and Adversarial Perspectives
Adversarial mindsets are crucial for uncovering security vulnerabilities. Simultaneously, users without prior development involvement can identify harms that may affect everyday users.

Assign Roles Strategically

Align red teamers’ expertise with specific harms (e.g., security experts probing for jailbreaks).
Consider rotating assignments in subsequent rounds for fresh perspectives while allowing time for adaptation.
Dedicate varying levels of effort to different testing scenarios, such as benign vs. adversarial use cases.

2. Define What to Test

Red teaming should encompass multiple layers of the application’s architecture:

Base Model Testing: Evaluate the underlying LLM with its default safety systems.
Application Testing: Focus on the developed application and its user interface (UI).
Iterative Testing: Test both the base model and application before and after applying mitigations.

Key Recommendations:

Begin with open-ended exploration to identify harms and assess the risk surface.
Test iteratively with and without mitigations to evaluate their efficacy.
Conduct real-world testing on the production UI to reflect actual user interactions.

3. Define How to Test

Open-Ended Testing
Encourage creativity by allowing red teamers to explore a wide range of issues without being confined to predefined harms. This approach helps uncover blind spots in risk assessments.

Guided Red Teaming
Once harms are identified, provide a structured list with definitions and examples for targeted testing in later rounds.

Prioritize Harms
Base prioritization on factors such as severity, frequency, and the likelihood of occurrence within the application’s context.

4. Data Collection

Streamline Data Collection
Define essential data points, such as inputs, outputs, timestamps, and unique identifiers for reproducibility. Avoid overwhelming red teamers with excessive data requirements.

Use a Shared Repository
A shared spreadsheet or database facilitates collaboration, reduces duplication, and promotes creative cross-pollination among red teamers.

During and After Testing

Active Monitoring

Remain available during testing to address issues, clarify instructions, and monitor progress. Proactive support can improve the efficiency and effectiveness of red teaming exercises.

Reporting and Follow-Up

Regular Reporting
Share concise, periodic reports with key stakeholders, highlighting:

Top identified issues.
Links to raw data.
Plans for upcoming rounds.
Acknowledgment of red team contributions.

Content Warnings
When sharing reports containing sensitive examples, include appropriate content warnings to mitigate potential discomfort or misuse.

Clarify Objectives
Differentiate between identification (red teaming) and systematic measurement (rigorous analysis of harm prevalence). Ensure stakeholders understand that specific examples are not metrics of harm frequency.