SteveITpro - Learning AI & Cloud in Public

Microsoft's pitch for enterprise AI has shifted from building individual agents to managing fleets of them—potentially thousands—across organizational boundaries. The Foundry Control Plane, announced at Ignite 2025, consolidates observability, security, and governance into one platform. The timing matters: whilst 81% of business leaders plan to integrate AI agents within 18 months, every metric of public trust in AI is deteriorating.

Sarah Bird, presenting at Microsoft Ignite's BRK205 session alongside Ken Archer and Florent Ricci, acknowledged the disconnect directly: "If we want everybody to use this technology successfully, they're gonna need to trust it." The session's focus on runtime controls, cross-platform visibility, and what Microsoft calls "defense in depth" represents an attempt to bridge that gap with technical architecture rather than marketing assurances.

The trust crisis Microsoft isn't mentioning in marketing materials

Session: BRK205 - AI Operations to own the fleet, master the mission in Microsoft Foundry Speakers: Sarah Bird, Ken Archer, Florent Ricci When: November 19, 2025, 9:00 AM PST Where: Moscone West, Level 2, Room 2016

Bird opened with data from the Work Trend Index: 81% of leaders expect to integrate agents into their AI strategy over the next 12-18 months. Impressive adoption numbers, until you hear the research she cited next.

A longitudinal study by KPMG and the University of Melbourne, tracking public perception since ChatGPT's launch, shows all trust metrics moving in the wrong direction:

People are more worried about AI systems
They perceive AI as less trustworthy than before
They're less willing to rely on AI

Bird framed this as "a huge challenge" for change management. The session's subtext: Microsoft needs to solve operational trust problems before enterprises deploy agent fleets at scale, or the 81% adoption projection becomes wishful thinking.

The three risks customers actually care about

The KPMG study identified three concerns that resonate with enterprise deployments:

1. Agents going off-task

The agent doesn't do what you intended. In a chat interface, this is annoying. In an agent with API access to your infrastructure, this could mean unintended actions with operational consequences.

2. Prompt injection attacks

New attack vector, specific to AI systems. Malicious instructions embedded in external content cause agents to behave unexpectedly. Bird called this the "top concern" for agents that can take actions—which is precisely what makes them useful.

3. Sensitive data leakage

The paradox: agents derive value from accessing your data, but that access creates exposure risk. One prompt injection attack or agent mistake, and confidential information leaves the organization.

Bird's reassurance: "This can seem daunting, but there's actually a lot we can do about it to build this technology the right way."

The question is whether "the right way" is technically sufficient, or whether it simply makes the risks more palatable to procurement departments.

Defense in depth: Extending identity controls to agents

Microsoft's strategy operates on two dimensions:

Dimension 1: Treat agents as entities, like users and devices

Organizations already secure users and devices with identity, access control, and governance frameworks. Microsoft's approach: extend these existing tools to agents.

Why this matters:

Agents get Entra Agent ID (verified identity and clear ownership), eliminating the confusion problem when managing hundreds of agents. They're governed by the same security planes as humans.

Dimension 2: New AI-specific defenses for new AI-specific risks

Prompt injection and task misalignment aren't addressed by traditional security. Microsoft developed targeted tools:

Prompt Shields: Detect and block adversarial input
PII Detection: Scan outputs for sensitive data before it leaves
Task Adherence Detection: Identify when agents deviate from intended behavior

Layer these AI-specific defenses with traditional security controls, and you get what Microsoft calls "defense in depth"—comprehensive protection against both conventional and novel AI risks.

Whether this architecture holds against real-world attacks at scale remains an operational question, not a marketing one.

The four essentials: Controls, observability, security, fleet operations

Bird outlined four requirements for trustworthy AI systems at scale:

1. Controls: Traditional boundaries plus runtime intervention

Two control types:

Traditional controls: Standard guardrails and pre-defined policies

Real-time runtime controls: Active monitoring during execution, dynamic enforcement as agents operate

The session demonstrated why runtime controls aren't optional—traditional guardrails that only check prompts and completions miss attacks embedded in tool calls and tool responses.

2. Observability: Moving humans from inner loop to outer loop

The dilemma:

In chat systems, humans review every output. Effective for quality, but defeats the purpose of agents—which derive value from autonomous operation over time.

Bird's framing: "We don't want humans reviewing every single step for every single action. The benefit of an agent is that it can run for a long time, potentially complete a bigger task autonomously."

The solution:

Move human oversight from inner loop (reviewing every action) to outer loop (defining behavior, validation steps, and monitoring).

Observability tools enable this shift:

Specify how agents should behave
Get ongoing assurance they're behaving correctly
Maintain oversight without reviewing every step
Enable autonomous operation for longer periods

The outcome: Human oversight plus agent autonomy. Whether this balance holds in practice depends on whether observability catches misbehavior before damage occurs.

3. Security: Trust is a team sport (developers can't do this alone)

Bird's critical insight:

"As a developer, there's a lot that needs to be done. But developers can't do it alone. Everybody has a role to play."

What's required:

Integration of security and IT teams into developer tools, not bolted on separately. This isn't organizational advice—it's architectural requirement.

If security and IT capabilities aren't embedded in the development workflow, collaboration fails and agents ship with gaps.

4. Fleet-wide operations: You can't manually test thousands of agents

Bird's challenge to "frontier firms":

"We're not just building one agent for one chat system anymore. If you want to be a frontier firm, you're gonna have thousands or maybe more agents."

The operational reality:

You can't spend time testing each one manually. Fleet-wide operations provide:

Intelligent prioritization: where to spend your time
Issue identification: where problems exist
Optimization opportunities: where improvements matter

This is what Foundry Control Plane is designed for: managing agent fleets at scale, not individual agents one at a time.

Live demo: Prompt injection attack exposes three risks simultaneously

Ken Archer demonstrated a real vulnerability in a kiosk agent built for Contoso—designed to answer product questions, grounded in a product catalogue.

The attack:

Archer embedded malicious instructions in the grounding content (a backdoor). He approached the kiosk and asked an innocent product question.

Agent response:

Instead of product information, the agent shared:

First names
Last names
PII data

Bird's reaction: "This is just an innocent product question. The agent should definitely not give me this information."

Three study risks in one attack

1. Indirect prompt injection

Malicious instructions hidden in content. The agent responded to the embedded backdoor.

2. PII leakage

Personally identifiable information available in a tool. The agent shared sensitive data inappropriately.

3. Off-task behavior

Archer wrote the system prompt specifically. The agent completely violated its intended task.

Important detail: The evaluation system successfully detected the attack—showing "indirect attack fail" in quick evaluations. Detection worked. The question is whether it would have triggered intervention before data leaked in production.

Foundry Control Plane guardrails: Monitoring tool calls, not just prompts

Archer demonstrated the developer experience for creating guardrails in Foundry Control Plane.

Current state: Risks with and without controls

Risks with controls (on by default):

Content safety
Jailbreak detection
Four common risks across most AI agents

Risks without controls:

Task drift
Sensitive data detection
Indirect prompt injection

Creating a new guardrail: Three controls

Control 1: Indirect Prompt Injection

Configuration:

Risk type: Indirect injection attacks
Intervention point: Tool response (not just prompts/completions)
Monitoring: User input AND tool responses
Action: Block

The architectural insight:

Traditional guardrails only check prompts and completions. Foundry Control Plane intervention points include tool calls and tool responses—monitoring agent behavior across the execution flow, not just content filtering.

Control 2: PII Detection

Configuration:

Risk type: Personally identifiable information
PII types: All (comprehensive)
Intervention points: Tool call AND tool response AND completion
Action: Block

Control 3: Task Adherence

Configuration:

Risk type: Off-task behavior
Intervention points: Adapt to selected risk
Focus: Controlling agent behavior, not just content
Action: Block when agent deviates from task

Inner loop and outer loop defense

The guardrails operate in both development and production:

Inner loop (development):

Quick evaluations during development
Detect issues before deployment
Iterative testing and refinement

Outer loop (production runtime):

Real-time monitoring
Runtime threat detection and mitigation
Continuous compliance validation

This is first-line defense that operates at runtime, adapting to how agents actually behave during execution.

Evaluating agents: "What's the weather?" has three failure points

Bird used a deceptively simple example to illustrate why agent evaluation requires multiple dimensions.

Query: "What's the weather?"

Three stages, three failure modes

Step 1: Understand intent

Agent must parse the question, determine information needed, identify required context.

Failure mode: Doesn't understand user intent, misinterprets question, fails to identify necessary information.

Success: Agent recognizes it needs current time, user location, and weather data.

Step 2: Call the right tools

Agent must look up location and time, query weather data, execute tools in correct sequence.

Failure mode (tool call accuracy): Picks wrong tool, calls tools in wrong order, passes incorrect parameters.

Step 3: Complete the task

Agent must process tool responses, synthesize information, provide useful answer.

Failure mode: Gets all the right information, calls the right tools, but doesn't successfully complete the task—fails to synthesize response correctly.

The evaluation challenge

These are sub-metrics that require monitoring. There are many more dimensions beyond simple success/failure.

Simple queries hide complex operational requirements. Multiply this across thousands of agents, and manual evaluation becomes impossible—hence the need for fleet-wide observability.

Cross-platform visibility: The "no matter where they are built" promise

Foundry Control Plane integrates with Agent 365 to provide:

Complete organizational view:

Agents built on Microsoft platforms
Agents built on third-party platforms
Agents built on other clouds

Unified operational view:

Observability and performance: Real-time monitoring across entire fleet, performance metrics unified across platforms

Cost management: Consolidated cost view regardless of underlying platform

Safety and security: Unified security posture delivered in one view

Agent controls: Trace exactly how agents behave—prompts, tools invoked, outputs produced

Integration with Microsoft security stack

Microsoft Purview: Policy enforcement across entire fleet. Sensitive data controls apply regardless of where agent runs. Compliance built into operations, not added later.

Microsoft Defender: Real-time defense, active threat protection, runtime security monitoring integrated directly into control plane.

Entra Agent ID: Verified identity and clear ownership for every agent. Solves identity chaos when managing hundreds of agents.

The promise: "Every agent, any platform, Foundry Control Plane."

The question: does cross-platform visibility extend to meaningful governance controls, or just dashboards?

Real-world deployment: Stanley Black & Decker's agent fleet

Florent Ricci from Stanley Black & Decker provided operational validation of the concepts Bird and Archer demonstrated.

The deployment scale

10,000 daily active users using Stanley Black & Decker's AI solutions, asking hundreds of thousands of questions every day.

The problem: Out-of-the-box LLMs hallucinate on product compatibility

Example: User asks if a DeWalt battery is compatible with a specific saw.

Generic Copilot (out-of-the-box):

One query says battery is compatible
Slightly different query (5-inch blade instead of unspecified) says not compatible
Both answers can't be correct

The root cause:

LLMs pull information from the internet, assemble it statistically, and provide the most probable answer. No source of truth. They can hallucinate because:

Competing information online
No grounding in actual product specifications
Statistical plausibility doesn't equal correctness

The solution: Grounded agent with access to source of truth

"Rosche" - Stanley Black & Decker's internal agent:

Connected to authoritative product data
Accesses correct compatibility information
Provides answers grounded in actual specifications

The result: Correct answers to product questions, which drives trust, which drives adoption.

Ricci's observation: "If the AI agent does not provide good answers, people will not use it and the adoption rate will fall. This is probably one of the reasons why we hear a lot about projects that don't escape the POC."

The iceberg: What it takes to get right answers

Visible (above water): The AI agent answering questions

Invisible (below water): The infrastructure required for trustworthy answers

Ricci outlined four critical points:

1. Fresh data, not just good data

The operational challenge:

Stanley Black & Decker introduces 1,000 new products every year. Every single week, new products launch and old products discontinue.

What this means for agents:

Rosche must know about new products immediately
Rosche must know when products are discontinued
Data must be continuously indexed and refreshed
Monitoring required to ensure indexing pipeline stays current

Fresh data isn't optional—it's operational.

2. The secret sauce: Balancing accuracy, latency, and cost

The trade-off:

Initially, Rosche gave detailed answers but took 1-2 minutes. Customers on the phone can't wait that long—they need answers in 5-10 seconds.

Optimization required:

How much data to retrieve
Different indexing strategies
Different retrieval strategies
Filtering to retrieve only most relevant information
Keep prompts small → faster responses → lower cost

Ricci's framing: "You need to find the right balance between accuracy, latency, and cost. To find the secret sauce with all of that, it's not very easy. A lot of iteration and a lot of tests."

3. Observability: Before and after deployment

Pre-deployment evaluation:

Run evaluations on:

Groundedness
Relevance
Other quality metrics

Post-deployment continuous monitoring:

Keep measuring that hallucination rate is not going up.

Why continuous monitoring matters:

Data keeps moving
New skills released for agents
Can lead to bad reasoning
Can lead to regressions

Ricci's emphasis: "It's absolutely critical to look at the metrics, to run evaluations before you go out, but also after."

4. The human element: Training users to fact-check

Stanley Black & Decker spends significant time training end users to fact-check information:

Open the source
Double-check: Is it the right brand? The right SKU?
Gain confidence that the answer is correct

The shift in work:

Ricci on Sarah Bird's earlier comment about inner/outer loop:

"This is really what we are starting to think—the new way of working. Today we are augmenting our users, but tomorrow, probably they will be more monitoring our agents to look at the questions they answer well, questions they don't answer well. And maybe there is a pattern of questions where we need a new knowledge article somewhere. I think that is going to change the way our employees are going to work."

The operational takeaway

Stanley Black & Decker's deployment validates Microsoft's claims about the four essentials:

Controls: Grounding in authoritative data prevents hallucinations Observability: Continuous monitoring catches hallucination rate increases Security: Data freshness and indexing pipeline integrity Fleet operations: 10,000 daily users at scale requires the infrastructure Bird described

But Ricci also exposed what the controlled demos didn't show: the operational effort below the waterline—fresh data pipelines, accuracy/latency/cost trade-offs, continuous monitoring, and user training.

What wasn't demonstrated

Several operational questions remain:

1. Performance overhead of runtime controls

Real-time monitoring across tool calls, tool responses, and completions adds latency. The session didn't address performance impact at scale.

2. False positive rates

Aggressive PII detection and task adherence monitoring will trigger false positives. How often do legitimate agent actions get blocked? What's the operational cost of tuning guardrails?

3. Multi-agent orchestration failures

Demonstrations showed individual agent vulnerabilities. What happens when thousands of agents interact? Emergent behavior across agent fleets wasn't addressed.

4. Audit and compliance reporting

Observability was framed around developer and security teams. What about compliance teams needing audit trails for regulatory requirements? The session didn't cover reporting capabilities.

5. Incident response workflows

When guardrails detect an attack, what happens next? The demonstration showed detection. The operational playbook for response wasn't detailed.

The honest assessment

Foundry Control Plane addresses real operational problems:

What's genuinely useful:

Cross-platform visibility: If it delivers on the promise, managing agents across Microsoft, third-party, and cloud platforms from one control plane solves a real fragmentation problem.

Runtime intervention points: Monitoring tool calls and tool responses, not just prompts and completions, targets where attacks actually occur in agent workflows.

Inner/outer loop separation: Moving human oversight from reviewing every action to defining behavior and monitoring frees agents to operate autonomously whilst maintaining governance.

Integration with existing security stack: Extending Entra, Purview, and Defender to agents treats them as first-class entities, not afterthoughts.

What's concerning:

Trust metrics declining: The KPMG study Bird cited shows public trust deteriorating whilst Microsoft accelerates agent deployment. Technical controls don't address perception problems.

False positive operational cost: The session didn't address how often legitimate agent behavior gets blocked, or the effort required to tune guardrails for production.

Performance overhead unclear: Runtime monitoring adds latency. How much, and at what scale, wasn't specified.

"Frontier firm" positioning: Framing thousands of agents as inevitable creates pressure to deploy at scale before operational maturity. Not every organization needs to be a "frontier firm."

Detection isn't prevention: The kiosk demo showed successful attack detection. It didn't show the attack being stopped before PII leaked. Detection after exposure isn't sufficient for compliance.

The verdict

Foundry Control Plane represents Microsoft's attempt to make agent fleet management operationally viable. The architecture is sound: runtime controls at intervention points, cross-platform visibility, integration with existing security infrastructure, and observability that enables outer-loop human oversight.

Whether it's sufficient depends on operational questions the session didn't answer: performance overhead, false positive rates, incident response workflows, and audit capabilities.

The larger question is whether technical controls can restore trust that Bird's own research shows is declining. Organizations deploying thousands of agents will need more than guardrails—they'll need operational evidence that the controls work at scale, under adversarial conditions, before trust recovers.

Microsoft has built the infrastructure. The operational proof remains to be demonstrated outside controlled conference demos.

What to watch

Real-world deployment data: Independent validation of guardrail effectiveness beyond Microsoft's internal deployments and conference demonstrations.

Performance benchmarks: Actual latency impact of runtime controls at scale, particularly for high-throughput agent workloads.

False positive rates: Operational data on how often legitimate agent behavior gets blocked, and the effort required to tune controls for production use.

Cross-platform governance depth: Whether "every agent, any platform" extends to meaningful policy enforcement, or just observability dashboards.

Compliance and audit capabilities: How Foundry Control Plane addresses regulatory requirements for agent audit trails and compliance reporting.

Multi-agent orchestration: How controls operate when thousands of agents interact, and whether emergent behavior creates new attack surfaces.

Trust metric evolution: Whether technical controls actually restore public trust, or whether the KPMG trends continue deteriorating despite operational improvements.

Learn More

Official Resources:

Managing GenAI Lifecycles

Microsoft Learn Documentation:

Technical Blog Posts:

Related Coverage: