TrustAI-LMAP (large language model mapper) User Manual

Jul 15, 2024

What’s TrustAI-LMAP

Basically, TrustAI LMAP is like NMAP for LLM.

Who needs TrustAI-LMAP

End users can use TrustAI LMAP to conduct trial testing on their commonly used GenAI Applications to confirm whether there are risks such as hallucinations and harmful content generation, which is crucial for protecting their privacy information and physical and mental health.
LLM service providers can use TrustAI LMAP to conduct comprehensive compliance testing and multi-dimensional adversarial testing for beta versions in development and testing, in order to quantify the security level of the model as much as possible before release, and develop defense measures and countermeasures in advance.
Enterprises who operating GenAI Application can use TrustAI LMAP to conduct comprehensive testing on content compliance, system security, data security, compliance governance, and other aspects of their online and developing LLM Agents, in order to avoid serious social public opinion or regulatory penalties caused by potential hacker attacks.
Government information technology departments and compliance governance departments can use TrustAI LMAP to continuously monitor and track the LLM industry situation, timely discover potential negative social public opinion and security risks, and continuously promote the rectification of relevant enterprises and units through quantifiable evaluation scores and results, and continuously push the security alignment of the entire society in the field of artificial intelligence.

How TrustAI LMAP enhances security alignment of AI

Improving the security alignment of artificial intelligence is a broad topic and not an easily achievable goal. It may require various stakeholders such as security communities, service providers, businesses, end-users, and governments to work continuously for 10 years or even longer to achieve a dynamic balance that is acceptable to the entire society.

Currently, artificial intelligence technology is rapidly integrating from efficiency tools, chat and entertainment tools, to intelligent applications, general terminal intelligent operating systems, and even embodied robot brains. In the foreseeable future, the security of artificial intelligence is not only related to the stability of the economy and society, but also closely related to the safety of every person's life on Earth.

We believe that the following things are crucial for artificial intelligence security：

Build a positive interactive community ecosystem, fully leverage the power of industry, community, and academia, and form a continuous, dynamic, and intelligent LLM security situational awareness and detection system, using adversarial testing and disclosure to promote defense and governance, and driving the continuous evolution of LLM security through the power of seeing.
Build a large-scale risk intelligence system that covers the global community, continuously conduct advance awareness of security risks and 0day vulnerabilities, to help enterprisesminimize reputation and asset losses to the greatest extent possible.
Build a benign co governance vulnerability disclosure system and platform, guide the security community to disclose LLM security risks and vulnerabilities in a reasonable and orderly manner, while fully considering the reasonable interests and demands of all stakeholders.

To address the security challenges and social concerns currently faced by LLM, we propose the TrustAI LMAP solution.

Essentially, TrustAI LMAP is a baseline assessment system for LLM and a tool to assist in improving the automation level and fuzz testing efficiency of LLM RedTeam. At the same time, TrustAI LMAP is an open platform for security researchers and developers.

Specifically, TrustAI LMAP provides the following core capabilities:

Everyone can upload their own unique security test set and conduct custom evaluations through the flexible and convenient LLM adaptation interface.
Default built-in a set of security testing datasets, including scenarios such as prompt injection, jailbreaking, PII loss, and hallucinations. Greatly reduces the threshold for users to conduct LLM testing and also lowers the learning threshold for LLM attacks.
A taxonomy of large language model red teaming strategies and techniques, at the same time, we are constantly integrating the latest Red Team strategies and technologies discovered by public sources, sota academic theory research, and our internal Red team.
Structured evaluation report, TrustAI LMAP's Risk Classifier automatically generates structured risk types and risk scores based on the response content of the tested target. At the same time, considering interoperability with the user's internal system, TrustAI LMAP generates structured information compatible with MISP specifications of the AVID taxonomies for security risks discovered during the testing process.

We believe that a feasible solution to continuously improve LLM security alignment goals is to scientifically and rigorously advance this field, through creative theoretical innovation, comprehensive structured vulnerability mining and evaluation methods, and joint efforts of the security community, to build a dynamic and sustainable large-scale model situational awareness system.

What’s LLM Rad Team？

An important part of secure delivery software is red team testing, which broadly refers to the practice of simulating real-world adversaries and their tools, strategies, and programs to identify risks, discover blind spots, validate hypotheses, and improve the overall security status of the system.

The practice of artificial intelligence red team testing has developed to have a broader meaning, including：

Detecting security vulnerabilities
Detecting other system failures, such as generating potentially harmful content

The intersection and differences between AI red teams and traditional red teams are：

The scope of the AI Red Team is broader. The AI Red Team is now a general term for detecting safety and RAI(responsible AI) results. The AI red team intersects with traditional red team targets, for example, some targets may include stealing underlying models. But AI systems also inherit new security vulnerabilities, such as prompt injection and poisoning, which require special attention. In addition to safety objectives, the AI Red Team also includes detecting fairness issues (such as stereotyping) and harmful content (such as glorifying violence) as results. The AI Red Team helps to detect these issues early.
AI Red Team testing focuses on unexpected outcomes that may arise from malicious and legitimate users. The AI Red Team test not only focuses on how malicious opponents can disrupt AI systems through security techniques and vulnerabilities, but also on how the system generates problematic and harmful content when interacting with ordinary users. Therefore, unlike traditional security red team testing that primarily focuses on malicious opponents, AI red team testing considers a wider range of roles and failures.
AI systems are constantly evolving. AI applications often undergo changes. For example, in large language model applications, developers may modify the meta prompt (prompt code) based on feedback. Although traditional software systems may also undergo changes, the speed of change in AI systems is faster. Therefore, it is important to conduct multiple rounds of red team testing on the AI system and establish an automated measurement and monitoring system for the system over time.
The red team generative AI system requires multiple attempts. In traditional red team exercises, using tools or techniques on the same input at two different time points always produces the same output. In other words, traditional red team exercises are generally deterministic. Generative AI systems, on the other hand, are probabilistic. This means that running the same input twice may produce different outputs. This is due to design, as the probabilistic nature of generative AI allows for a wider range of creative outputs. This also makes the Red Team exercise tricky, as prompts may not lead to failure in the first attempt, but will achieve success in subsequent attempts.
Mitigating AI failures requires deep defense. Just as in traditional security, issues such as phishing require various technical mitigation measures (such as strengthening hosts to intelligently identify malicious URL or attachments that contains malicious virus), fixing faults discovered through AI red teams also requires a defense in depth approach. This involves techniques such as "using classifiers to label potentially harmful content" and "using metapromps to guide behavior", alternatively, input and output security risks can be addressed by integrating TrustAI Guard.

The following figure is based on the development trend of ubiquitous artificial intelligence and describes the specific scope of the Red Team's work.

In general,

The closer the LLM service to be tested is to the standard HTTP interface or SDK API form, the greater the space for automated process optimization.
The closer the LLM service to be tested is to traditional non-standard GUI software, or involves a large number of forms that are considered dependent on operation steps, the more difficult it is to complete the red line through automation.

Why is baseline evaluation useless

Taking toxicity content generation evaluation as an example, using off-the-shelf prompt datasets for seeing how toxic a model’s generations are doesn’t scale.

The dataset can be big - RealToxicityPrompts is 3.7GB compressed - and that’s a hefty item to eval over as an iterative development target.

More importantly, models are changing all the time, and tactics and mitigations that work for one model (or model family) aren’t guaranteed to work for others. Even more crucially, a fixed test target - like a set of prompts - is going to become less useful over time as people develop better and different techniques to reducing certain behaviors. Just like dataset “rot” in machine learning, where things like MNIST become less representative of the underlying task over time because research has become overfit to them, prompt datasets also aren’t a be sustainable route for investigating propensity to generate toxicity in the long term. As people work out how to fix the problems a particular dataset’s data points present, that dataset becomes easier, but also a worse reflection of the real world task it’s meant to represent.

This dataset rot has a subtle effect: while scores continue to go up, and newer models get better at addressing a dataset - maybe even because the dataset gets into their training corpus via being published on the web - the proportion of the dataset that’s useful, that’s representative of the broader world, shrinks and shrinks. In the end, we see a high score where only a tiny part of the dataset represents current real-world performance. This is natural, and happens over time, but gradually dataset-driven metrics become detached from reality over time.

So what can we do about all this? One practice, adopted from the military into infosec and then info machine learning eval, is red teaming, where humans try to get a system to fail. Humans are pretty creative, and usually up-to-date, and it works pretty fine.

Red Team VS Blue Team: What's the Difference? - CrowdStrike

How TrustAI LMAP works

TrustAI LMAP is available as a Software as a Service (SaaS) website online product and is built on our constantly evolving security intelligence platform, continuously responding to the evolving and emerging LLM risks.

Our security intelligence platform combines insights from our TrustAI Red Team, and the latest LLM security research and techniques.

To learn more about working with the TrustAI LMAP, You can contact us via email：andrew@trustai.pro.

Evaluation Datasets

In the long run, as the data volume of LLM reaches trillions, there is a tendency for marginal effects to decrease between the amount of data and the performance benefits of the model. From this perspective, a set of technology evaluation standards recognized within the industry is particularly important. It is necessary to promote the deep integration of LLM with the industry through technology evaluation, and support industrial transformation.

IDC divides the LLM into three layers：

service ecosystem
product technology
industry applications

The capabilities of each layer are evaluated, mainly focusing on

algorithm models
general capabilities
innovation capabilities
platform capabilities
security interpretability
the application industry of the LLM
as well as supporting services and the LLM ecosystem.

TrustAI LMAP continuously tracks and integrates the latest evaluation standards and datasets in the industry. At the same time, we also have a professional Red Team research team internally, which continuously generates new adversarial samples and integrates them into the TrustAI LMAP platform by drawing on security communities, academic conference papers, and LLM based human-in-the-loop semi-automatic mining methods.

Mutation Module

Humanity has always had a history of challenges and breakthroughs in computing technology. The early Internet was often defeated by some programs and geeks who aimed to exploit vulnerabilities and spread them automatically. Defeating game copyright protection is a cat and mouse hobby for many people, rather than for economic gain. Early iPhone users would jailbreak their devices to change the background image.

The interaction between humans and machine learning models is no exception. People attack these models to understand their weaknesses, which has become so organized and effective that many large enterprise technology companies have dedicated attack teams that try to make the models fail in specific ways. At the same time, the work of these attack teams is integrated into the production process, and national governments also seek their opinions.

However, these targeted technologies often come with relatively high barriers to entry. For example, professional knowledge is needed to circumvent the manufacturer's root control over the phone.

Nowadays, the situation has changed. With the advent of the era of Large Language Models (LLMs), the barriers to entry for people attacking LLMs have become lower: as long as a person understands natural language, they can interact with LLMs through this medium alone. Therefore, the practice of attacking LLMs as a grassroots movement erupted through accessible chat based interfaces and language as a medium. So, compared to traditional attacks on machine learning models, LLM attacks have the following significant characteristics：

Activities are essentially about finding limits：People quickly come up with various new malicious use cases in their minds, but many damages are caused in fairly ordinary attacks.
The process is manual, not automated：Although some red teams are working hard to automate the process of vulnerability mining and testing. But overall, most people still agree that manual excavation is still a source of inspiration. The manual process is particularly interesting because for the first time in history, computer hacking attacks can be conducted through natural language without requiring any specific computational knowledge. Not only does it eliminate entry barriers, but the results can also be evaluated and appreciated by most people who can read and write, as the output is text. In addition, many tips and tricks for language model hacking attacks can be widely obtained, which may change the way more people generally think about computers and computer failure modes.
It is ultimately a team effort：Most people say they have found inspiration from other people's tips and jailbreaks.
It requires a bit of alchemist mentality：Many security researchers suggest that a strategy to understand LLM jailbreak vulnerabilities is to completely abandon any theoretical speculation and explanation of the model and its output, and accept the chaotic nature of LLM, which refers to hints as spells, these models as demons, and so on. This mentality is very useful in keeping you honest and not overly trusting your own hype about these models and their abilities.

Therefore, in the era of LLM, security research targeting LLM and attacks targeting LLM are also undergoing a dramatic paradigm shift. In summary, the main challenges faced are as follows：

The difficulty of jailbreak mining varies greatly depending on different attack intent questions and target models to be tested.
Because LLM is essentially a probabilistic model, attacks cannot be reliably replicated 100% and often miss the capture of 0day vulnerabilities due to insufficient attempts.
LLM vulnerability mining lacks a systematic mining theory and attack method classification framework.
The cause of LLM vulnerabilities lacks interpretability theory, and the effectiveness of LLM endogenous defense lacks theoretical support.

To address the aforementioned challenges and difficulties, we believe the key lies in building a systematic and structured taxonomy of large language model red teaming strategies.

A taxonomy of large language model red teaming strategies

We believe that a systematic and structured classification system can provide a starting point and framework for discussion and further research. Each strategy and technique can be seen as inspiration for imagining other red team methods or blue team defenses.

TrustAI LMAP is also continuously enriching and improving its taxonomy architecture and attack techniques.

Risk Classifier

Once TrustAI LMAP discovers something, it will report exact prompts, targets, and responses, so you will get a complete log that includes everything worth checking and why it may be an issue.

It is not easy to determine when LLM goes wrong. Although this may sometimes be obvious to humans, the testing process of TrustAI LMAP typically generates tens of thousands of outputs, thus requiring automatic detection of language model failures.

The Risk Classifier in TrustAI LMAP is designed for this purpose, by searching for keywords or using machine learning classifiers to determine whether and to what extent the output is at risk.

Getting started with TrustAI LMAP

In order to provide users with a clearer and more intuitive understanding of the working principle of TrustAI LMAP, so that Red Team personnel can effectively apply TrustAI LMAP to their daily testing work, we have developed a simple website UI that encapsulates API interface calls. Users only need to click 2-3 times with the mouse to complete the preparation work for Red Team testing, and depending on the size of the scanning task, it takes between 30 seconds and 1 hour to conduct a Red Team testing.

At the same time, we are developing SaaS API interfaces that allow users to easily integrate TrustAI LMAP with their existing security scanning and vulnerability management systems within their own enterprises. After obtaining scan reports through the API interface, the scan results can be imported into their internal SOC platform for unified operation and security governance.

Below, I will introduce the usage of TrustAI LMAP in typical scenarios.

Determine the target model, and verify its connectivity

Basically, the target LLM to be evaluated can be divided into three main forms:

LLM provider's MaaS platform：Typically a fixed endpoint(like “api.openai.com“, etc.) that needs specifies the model name and hyperparameters such as temperature and maximum token count through parameters.
SaaS API interface providing GenAI capabilities：This type is commonly used in public service scenarios such as weather forecasting, content subscriptions, etc.
GenAI Application API independently deployed and maintained by enterprises.

You can select a target to be tested through a tab. We provide tabs for both the Maas interface and custom interfaces. Users only need to click the mouse to select a specific LLM tatget.

For each tab, the model name and hyperparameters are already configured by default, and users can also manually modify the hyperparameters in the input box.

After configuration is complete, users can click on "Verify Integration" to perform connectivity verification, and TrustAI LMAP will sending a test query (such as "hello, what can u do for me?”), and display the response results of the target LLM to be tested on the UI.

As shown in the above figure, if the target model returns a status code of 200 and generates appropriate response content, it indicates that the connectivity is normal.

Choose a secure adversarial datasets

LLM model evaluation is a broad and highly comprehensive theoretical and engineering challenge. In order to evaluate whether LLM has the potential to become a universal intelligent system, the evaluation set often needs to include thousands of annotated datasets from different fields, covering language, mathematics, science, humanities, logical thinking, programming ability, comprehensive knowledge, security, and other areas.

At the current stage, TrustAI LMAP mainly focuses on security adversarial testing, which is also the main area of concern in red team testing. We are constantly improving our comprehensive risk taxonomy, which can be roughly divided into five categories:

Security vulnerabilities (such as XSS、CSRF, etc.)
Prejudice (such as spreading stereotypes or algorithmic unfairness)
Toxic content (such as defamation or pornographic content)
Hallucination content (such as fictional or misleading)
PII loss (such as leaking passwords or social security numbers)

Generate adversarial datasets

According to the taxonomy of large language model red teaming strategies, we have abstracted different mutator modules.

Based on the taxonomy of large language model red teaming strategies, we have abstracted different mutation modules. After configuring the "Security Testing Dataset" and "Mutation Modules", users can click the "Generate adversarial datasets" button to generate and download the transformed adversarial prompt words.

As shown in the above figure, we have selected two security testing datasets, "Prompt_iInjections_1" and "Jailbreak_Test", and then selected the "Steganography" deformation template. After clicking the "Generate adversarial datasets" button, TrustAI LMAP automatically generates and downloads the generated mutated adversarial security testing dataset. The mutations fuzz time may vary from 1 second to 1 hour depending on the selected dataset and the complexity of the deformation module.

Upload custom test datasets

In addition to selecting the default built-in security testing dataset, users can easily convert their own or collected jailbreak prompts from the community into CSV or JSON files according to the sample format and upload them to the TrustAI LMAP cloud.

The custom security test dataset needs to follow a certain format, and users can click "Download custom dataset sample" to download the custom dataset sample.

After successful upload, the uploaded file can be viewed in the "Custom CSV" or "Custom JSON" module. Users can select the "Custom CSV" or "Custom JSON" test dataset and use these custom security test datasets in subsequent deformation generation and evaluation.

Users can click "Delete all custom test datasets" to delete all custom security test datasets uploaded in the past and upload new custom security test datasets.

Start security adversarial testing

When the user has configured the target to be tested, selected security test datasets or custom security test datasets, and chosen the mutation module, they can click "Run Scan" to start the scan.

Due to the fact that the website UI is only used for demo demonstrations, the scanning frequency may not be relatively high.

We have shown the progress of the scan in the 'Scan Results' section below. Depending on the selected evaluation dataset size and the mutation module, the scan will take anywhere from 1 minute to 60 minutes. Users can switch to other pages to continue working and wait for TrustAI LMAP backend scanning to complete.

Download the evaluation report

After the TrustAI LMAP backend completes scanning the tasks submitted by the user, the user can click "Download All Reports" to download the scan report.

In the report, the prompt column represents the prompt query sent to the target LLM, and the response column represents the response result of the target LLM.

By reviewing the scan report, users can clearly see the security risks of insufficient endogenous defense in the target LLM when facing attacks.

Here are a few security risks discovered by practical tools for reference only.

A tutorial on illegal intrusion guidance after jailbreak

Prompt hijacking leads to prompt leakage

Prompt hijacking leads to improper political speech

Guide LLM to generate prohibited content after successful jailbreak through scarce language attack

Guide LLM to produce discriminatory speech after jailbreaking through long tail encoding

Guide LLM to generate illegal behavior through ICL jailbreak tutorial

Submit scanning tasks through API interface(enterprise version only)(preview)

There are two reasons why enterprise customers hope TrustAI LMAP can provide SaaS API interfaces:

After the beta version completes the repair of historical vulnerabilities, developers often need to conduct comprehensive regression testing, which requires a lot of scanning time.
SaaS based API interfaces facilitate interoperability between TrustAI LMAP and user internal systems.

Do we even need human red-teamers

The last question, do we even need human red-teamers? We know our redteam models are capable of producing a broad range of output, and even if they only get single-digit success rates, runnning them can be scaled easily.

My answer here is a strong yes - we really do need red-teamers.

Firstly, most people attack the same thing, few people are creative, and there’s not much info on the creative attempts.
Further, LLMs, like other models, do have a tendency to regress to the mean, and be a bit bland. This means the range of automatic red teaming tactics is not likely to be broad. We can alter the generation temperature, but this doesn’t lead to structured approaches, although it’s something that can be scaled. Scaling high-temperature generation can only receive a hit yields diminishing returns in efficiency.

Contact Us

To learn more about working with the TrustAI LMAP, You can contact us via email：andrew@trustai.pro.

You can also visit our company's website to learn more information.

Embrace the AI

Discussion about this post