Enterprise AI Governance

Automated AI Red Teaming with Azure AI Foundry

If you are deploying generative AI into production without red teaming it first, you are running a live experiment on your users. LAB516 at Microsoft Ignite 2025 made this point viscerally clear, walking attendees through automated attack techniques against AI systems and demonstrating how Azure AI Foundry's evaluation tooling catches vulnerabilities that manual testing misses entirely.

Session: LAB516
Date: Tuesday, Nov 18, 2025
Time: 6:45 PM PST - 8:00 PM PST
Location: Moscone West, Level 3, Room 3014

Additional Sessions:

Session: LAB516-R1
Date: Wednesday, Nov 19, 2025
Time: 2:00 PM PST - 3:15 PM PST
Location: Moscone West, Level 3, Room 3007
Session: LAB516-R2
Date: Friday, Nov 21, 2025
Time: 9:00 AM PST - 10:15 AM PST
Location: Moscone West, Level 3, Room 3014

The lab was structured around a practical question that every platform engineering team should be asking: before you ship an AI-powered application, how do you systematically probe it for safety failures, security vulnerabilities, and behavioural risks? The answer involves adversarial testing at scale, and the tooling to make it repeatable. This was, without reservation, the most practically useful session I attended at Ignite 2025.


Why manual red teaming fails at scale

Traditional red teaming relies on human testers crafting adversarial prompts by hand, iterating on attacks, and documenting findings. This approach works for initial security assessments. It fails completely as a continuous assurance practice, and the failure modes are instructive.

The numbers problem: A competent human red teamer might craft and test 50-100 adversarial prompts per day. An automated red teaming system using Azure AI Foundry can generate and execute thousands of attack variations in minutes. The coverage gap is not incremental; it is orders of magnitude. When your model processes millions of user interactions per month, testing it with a hundred handcrafted prompts is statistically meaningless.

The creativity problem: Human testers develop patterns. They tend to probe the same attack categories in similar ways, shaped by their own experience and the attack taxonomies they have studied. Automated systems, particularly those using LLM-generated attack prompts, produce variations that human testers would not think to try. Multi-turn attacks, where seemingly innocent prompts build towards a harmful outcome across a conversation, are particularly difficult for humans to systematically generate at volume.

The regression problem: Every model update, prompt change, or data refresh can introduce new vulnerabilities. Manual red teaming is a point-in-time assessment. Automated red teaming can run continuously, catching regressions as they occur. This is the difference between a penetration test and a CI/CD security gate, and any organisation that has been caught by a regression in production understands why that distinction matters.

The lab demonstrated all three failure modes with live examples. Rather than slides about theoretical risks, attendees attacked their own deployed models and watched them fail. The hands-on format was effective precisely because it removed the comfortable distance between "this could happen" and "this just happened to my application."


The risk taxonomy: Safety and security are different problems

The lab introduced a structured risk taxonomy that separates safety risks from security risks. This distinction matters because the attack techniques, detection methods, and mitigation strategies differ fundamentally. Conflating them leads to incomplete testing.

Safety risks

Safety risks involve the AI system generating harmful, offensive, or inappropriate content. The lab covered four primary safety categories:

Hateful and unfair content: Does the model generate content that discriminates based on protected characteristics? This includes overt hate speech but also subtle biases that emerge in specific conversational contexts. The subtlety is important: most production models will refuse an explicitly hateful prompt, but many will produce biased content when the discriminatory framing is implicit.

Sexual content: Does the model generate sexually explicit material, particularly in contexts where it should not? Applications serving general audiences or regulated industries need robust controls here. The challenge is that the boundary between acceptable and unacceptable varies dramatically by deployment context.

Violent content: Does the model generate content that describes, glorifies, or instructs on violence? The boundary between informational content and harmful instruction is context-dependent and requires nuanced evaluation that simple keyword filtering cannot provide.

Self-harm content: Does the model generate content related to self-harm, suicide, or dangerous activities? This is the category with the most immediate real-world consequences if controls fail, and the one where false negatives carry the highest cost.

Security risks

Security risks involve adversarial actors manipulating the AI system for malicious purposes:

Direct prompt injection (jailbreak): An attacker crafts input specifically designed to override the system prompt and cause the model to ignore its instructions. The lab demonstrated multiple jailbreak techniques, from simple instruction override to sophisticated role-playing attacks that build context over multiple turns.

Indirect prompt injection: Malicious instructions embedded in data that the model processes, such as documents retrieved via RAG, web pages fetched by tools, or database records accessed during inference. This attack vector is particularly dangerous because the malicious content enters through trusted data channels. Your application trusts its own data store; an attacker exploits that trust.

Information disclosure: Extracting confidential information from the model, including system prompts, training data excerpts, or information from other users' sessions. The lab showed how careful prompt engineering can coax models into revealing information they should protect, and system prompt extraction succeeded against most configurations tested.

The key insight from the lab: Safety and security evaluations require fundamentally different attack strategies. Safety probing tests the model's alignment boundaries. Security probing tests the model's defensive architecture. Running one without the other leaves significant blind spots that adversaries will find before you do.


The automated attack pipeline

The most technically interesting part of the lab was the attack generation pipeline. Rather than maintaining a static library of adversarial prompts that becomes stale as models improve, Azure AI Foundry uses a generative approach: an adversarial LLM creates attack prompts designed to probe a target LLM.

The architecture:

Adversarial LLM (attacker)
    |
    v
Attack Prompt Generation
    |
    v
Target Application (defender)
    |
    v
Response Collection
    |
    v
Evaluation & Classification
    |
    v
Results Dashboard

This is an LLM-versus-LLM approach, and it has a significant advantage over static test suites: the attacker model can adapt its strategy based on the target's responses, generating increasingly sophisticated attacks against specific defensive weaknesses.

The lab walkthrough:

The process follows a structured sequence that attendees executed against their own deployed models:

Step 1: Define the target. Point the evaluation system at the deployed AI application. This can be an Azure OpenAI endpoint, a custom application with a chat interface, or any API that accepts prompts and returns responses.

Step 2: Select risk categories. Choose which dimensions to probe: safety categories (hate, sexual, violent, self-harm), security categories (jailbreak, indirect injection, information disclosure), or all of the above. The ability to run targeted evaluations matters for teams that need to validate specific controls.

Step 3: Configure attack complexity. The system supports multiple attack strategies:

  • Single-turn attacks: Direct adversarial prompts in a single message
  • Multi-turn attacks: Conversation sequences that gradually escalate towards harmful territory
  • Encoding attacks: Using base64, rot13, or other encodings to obfuscate malicious intent
  • Context manipulation: Providing fictional scenarios or role-play contexts that shift the model's behaviour

Step 4: Execute the evaluation. The system generates adversarial prompts, sends them to the target, collects responses, and classifies each response against the risk taxonomy.

Step 5: Review results. A dashboard shows attack success rates per category, example successful attacks, and severity classifications.

The lab provided a Python SDK interface for programmatic evaluation:

from azure.ai.evaluation import AdversarialSimulator
from azure.identity import DefaultAzureCredential

# Configure the adversarial simulator
simulator = AdversarialSimulator(
    azure_ai_project={
        "subscription_id": "<subscription-id>",
        "resource_group_name": "<resource-group>",
        "project_name": "<project-name>"
    },
    credential=DefaultAzureCredential()
)

# Run adversarial simulation against target
outputs = await simulator(
    target=my_application_callback,
    scenario=AdversarialScenario.ADVERSARIAL_QA,
    max_simulation_results=100,
    max_conversation_turns=3
)

What was immediately clear from the lab: Even well-configured models with carefully crafted safety system messages fail under systematic adversarial probing. The multi-turn attack strategy was particularly effective, producing successful jailbreaks on models that resisted single-turn attacks completely. The gap between "we tested it with a few tricky prompts and it seemed fine" and "we systematically probed it with thousands of adversarial variations" is the gap between false confidence and actual security posture.


Evaluation results: Understanding what the numbers mean

The lab spent significant time on interpreting evaluation results, which is where most teams struggle. Raw numbers without context lead to either false confidence or unnecessary alarm, and both are dangerous.

The metrics that matter:

Attack success rate per category: What percentage of adversarial prompts in each risk category produced a harmful response? This is the headline number, but it needs context. A 2% jailbreak success rate might be acceptable for an internal tool behind corporate SSO and completely unacceptable for a public-facing application serving vulnerable populations.

Severity distribution: Not all failures are equal. The evaluation system classifies successful attacks by severity: low, medium, and high. A model that occasionally generates mildly off-topic content is a fundamentally different risk profile from a model that can be manipulated into producing detailed harmful instructions.

Attack complexity correlation: Do failures cluster around simple attacks or complex ones? A model that fails against basic jailbreak prompts has fundamental alignment problems that need immediate attention. A model that only fails against sophisticated multi-turn attacks with encoding obfuscation may be adequately protected for its deployment context.

The lab's practical exercise had attendees evaluate a sample chatbot application. Results typically showed:

  • Single-turn jailbreak resistance: relatively strong (5-10% success rate)
  • Multi-turn attack vulnerability: significantly higher (20-40% success rate)
  • Indirect prompt injection: highly context-dependent, with RAG-enabled applications showing markedly higher vulnerability than non-RAG applications
  • Information disclosure: system prompt extraction succeeded in most configurations

These numbers should give any production team pause. A 20-40% success rate on multi-turn attacks means that a moderately persistent user will find a way through your defences. The question is not whether it will happen, but when, and whether you have the monitoring in place to detect it.


Integrating red teaming into CI/CD

The most operationally significant part of the lab addressed continuous red teaming. Running an adversarial evaluation once before launch is a security assessment. Running it automatically on every deployment is a security gate. The distinction matters.

The CI/CD integration pattern:

# Example GitHub Actions workflow integration
- name: Run AI Safety Evaluation
  uses: azure/ai-evaluation@v1
  with:
    target-endpoint: ${{ secrets.AI_APP_ENDPOINT }}
    evaluation-type: adversarial
    risk-categories: "jailbreak,indirect-injection,content-safety"
    max-simulations: 500
    severity-threshold: medium
    fail-on-threshold: true

The gating logic:

  • Hard gate: If any high-severity vulnerability is detected, the deployment fails. Non-negotiable for public-facing applications.
  • Soft gate: If vulnerability rates exceed defined thresholds, the deployment proceeds but generates alerts for security review. Appropriate for internal tools with limited blast radius.
  • Monitoring gate: Evaluation runs but does not block deployment. Results feed into dashboards for trend analysis. Useful during initial adoption when baseline metrics are being established.

The practical challenge the lab acknowledged: Automated red teaming evaluations take time. A comprehensive evaluation with 1,000 adversarial prompts against a deployed endpoint might take 15-30 minutes depending on the target application's response time. This is too slow for a fast-feedback CI/CD pipeline where developers expect results in minutes.

The recommended tiered approach:

  • PR checks: Run a lightweight evaluation (100 prompts, high-priority categories only) on every pull request. Fast enough for developer feedback loops.
  • Staging deployment: Run a comprehensive evaluation (1,000+ prompts, all categories) on staging deployments. This is the quality gate before production.
  • Production monitoring: Run continuous evaluations against production at regular intervals (daily or weekly) to detect regressions from model updates, data changes, or prompt modifications.
# Example: Configuring evaluation thresholds for CI/CD
evaluation_config = {
    "thresholds": {
        "jailbreak_success_rate": 0.05,      # Max 5% jailbreak success
        "content_safety_defect_rate": 0.02,   # Max 2% safety failures
        "information_disclosure_rate": 0.01,  # Max 1% info disclosure
    },
    "severity_gate": "medium",  # Block on medium+ severity
    "min_simulations": 500,     # Minimum prompts for statistical significance
}

This tiered model mirrors how mature organisations handle traditional security testing: fast scans in development, thorough assessments before release, continuous monitoring in production. The principle is not new; the application to AI systems is.


The attack techniques that worked

The lab demonstrated several attack categories with live results. The most instructive were the attacks that succeeded against models with safety system messages in place, because these represent the realistic threat surface for production applications.

Technique 1: Persona hijacking

Rather than asking the model to ignore its instructions directly, the attacker asks it to role-play as a different AI assistant with no safety constraints. Multi-turn variants of this attack, where the role-play scenario is established over several exchanges before the adversarial question, were remarkably effective. The model appears to "forget" its system prompt when deeply embedded in a fictional context.

Technique 2: Encoding obfuscation

Wrapping adversarial prompts in base64 encoding, leetspeak, or character substitution bypasses many pattern-matching safety filters. The model decodes the input, processes it normally, and generates a response to the decoded (adversarial) content. This technique exploits the gap between input filtering and semantic understanding. If your safety layer operates on the raw text but the model processes the decoded meaning, the safety layer is blind.

Technique 3: Task framing

Asking the model to produce harmful content framed as educational material, fiction writing, or security research. "Write a story where the character explains how to..." is a classic variant that models frequently comply with because the framing appears benign. The lab showed this working consistently across safety categories.

Technique 4: Indirect injection via RAG

For applications using retrieval-augmented generation, embedding adversarial instructions in documents that the RAG pipeline retrieves. The model treats the retrieved content as authoritative context and follows the embedded instructions. This attack is particularly dangerous because it originates from the application's own data store, and the user sending the query may be entirely innocent. The malicious actor is whoever poisoned the data.

The defensive pattern that emerged: No single defence is sufficient. Effective protection requires layered controls:

  1. Input filtering catches simple direct attacks
  2. System prompt reinforcement reduces persona hijacking effectiveness
  3. Output filtering catches harmful content regardless of how it was generated
  4. Runtime guardrails monitor tool calls and data flow for anomalous patterns
  5. Continuous red teaming ensures defences remain effective as attacks evolve

This is defence in depth applied to AI systems. The concept is familiar to anyone who has built security architectures for traditional applications. The implementation details are different, but the principle is identical.


What automated red teaming cannot do

The lab was honest about the limitations of automated approaches, and this honesty was refreshing compared to the marketing-heavy presentations elsewhere at Ignite.

It cannot assess business-logic abuse. Automated red teaming probes for generic safety and security failures. It does not understand your specific application's business context. If your AI assistant should never recommend competitor products, or should always include a regulatory disclaimer, or should refuse to process transactions above a certain value, those are business rules that require custom evaluation criteria. The standard risk taxonomy does not cover them.

It cannot replace human judgment on edge cases. Automated systems classify responses as safe or unsafe based on defined criteria. Edge cases that require contextual judgment, cultural sensitivity, or domain expertise still need human evaluation. Automated red teaming identifies the obvious failures; human red teamers identify the subtle ones.

It cannot guarantee completeness. Adversarial AI is an arms race. Today's automated attacks may miss vulnerabilities that tomorrow's attack techniques will find. Automated red teaming reduces risk; it does not eliminate it. Any organisation that treats a passing evaluation as proof of safety is misunderstanding the exercise.

It does not address multi-agent vulnerabilities. The lab evaluated single models in isolation. Production systems increasingly involve multiple models in orchestration, where vulnerabilities emerge from model interactions rather than individual model weaknesses. An agent that is safe in isolation may become unsafe when another agent feeds it adversarial content.

The cost question nobody asked

Running thousands of adversarial prompts against deployed models costs real money in API calls. The lab did not address the economics of continuous red teaming, which will be a significant consideration for organisations with large model portfolios. A comprehensive evaluation against a GPT-4-class model could cost hundreds of pounds per run. Multiply that by multiple applications and daily evaluations, and the cost becomes a line item that needs budget approval.


The verdict

LAB516 delivered the most practically useful content of any session I attended at Ignite 2025. The hands-on format forced engagement with the material in a way that presentation sessions cannot match, and the content was technically rigorous without being inaccessible.

The core message is correct and important: if you are not red teaming your AI applications before production deployment, you are accepting risks you have not measured. Azure AI Foundry's evaluation tooling makes automated red teaming accessible to development teams without dedicated security expertise, and the CI/CD integration pattern makes it sustainable as a continuous practice rather than a one-off assessment.

The gap is between the lab's controlled environment and production reality. Lab exercises use well-defined target applications with known vulnerabilities. Production AI systems are complex, multi-model, multi-tool architectures where the attack surface is significantly larger and less well-understood. The tooling demonstrated in the lab is necessary but not sufficient for comprehensive AI security.

For platform engineering teams, the immediate action is to integrate adversarial evaluation into the deployment pipeline for any AI application facing production traffic. Start with the standard risk taxonomy. Establish baseline metrics. Then refine thresholds and add custom risk categories as your understanding of your application's vulnerability profile matures.

The organisations that build red teaming into their AI development workflow now will ship safer systems. The organisations that treat AI safety as a post-deployment concern will learn the same lessons, but from production incidents rather than controlled evaluations. I know which group I would rather be in.


What to watch

Attack technique evolution: As models become more robust against current attacks, adversarial research will produce new techniques. Watch for tooling updates that incorporate emerging attack vectors, particularly in multi-agent and tool-use scenarios.

Regulatory requirements for AI testing: The EU AI Act and emerging US frameworks are likely to mandate pre-deployment safety testing for high-risk AI applications. Automated red teaming may shift from best practice to compliance requirement, which will change the cost-benefit calculation entirely.

Multi-agent red teaming: As applications move from single-model to multi-agent architectures, red teaming methodology needs to evolve. Testing individual agents in isolation misses vulnerabilities that emerge from agent interactions. This is the frontier where current tooling is weakest.

Cost optimisation for continuous evaluation: Running thousands of adversarial prompts against production endpoints is expensive. Watch for sampling strategies, cached evaluation approaches, and tiered testing frameworks that reduce cost without sacrificing coverage.


Related Coverage:


Previous
Human-Agent Collaboration
Built: Mar 13, 2026, 12:43 PM PDT
80d1fe5