Azure AI Platform

Windows ML and NPU Acceleration — Building Smarter Apps

LAB584-R1 was the most hardware-focused lab at Ignite 2025, and possibly the most revealing about Microsoft's long-term strategy for AI inference. Whilst every other session assumed cloud inference via Azure OpenAI endpoints, this lab had participants running neural network models locally on laptop NPUs — no cloud, no API calls, no per-token billing. The lab covered Windows ML (now generally available), NPU execution providers, model compilation for specific hardware targets, and local inference optimisation. The practical takeaway: NPU acceleration works, the developer experience is improving, and the economics of edge AI are starting to make sense for specific workload profiles.

Session: LAB584-R1 — Windows ML and NPU Acceleration: Building Smarter Apps Date: Thursday, Nov 20, 2025 Location: Moscone South, Level 3, Room 308


What is an NPU and why should you care

A quick grounding for those who have heard the term but not investigated the hardware.

CPU: General-purpose processor. Can run any computation but is not optimised for any specific type. Neural network inference on a CPU works but is slow.

GPU: Designed for parallel computation on large matrices. Excellent for neural network training and inference. Power-hungry, expensive, and overkill for many inference workloads.

NPU (Neural Processing Unit): A processor designed specifically for neural network inference. Optimised for the matrix multiplications, convolutions, and activation functions that constitute neural network computation. Lower power consumption than a GPU, faster than a CPU for neural network workloads, and increasingly built into consumer laptops.

The strategic context: Microsoft's Copilot+ PC initiative requires devices to have NPUs capable of at least 40 TOPS (trillion operations per second). This means every new Windows PC sold under the Copilot+ brand has dedicated AI inference hardware. Windows ML is the software layer that lets developers target that hardware.

Why this matters for enterprise: If every employee laptop has an NPU, certain AI workloads can run locally without cloud API calls. This affects latency (local inference is faster than round-trips to Azure), cost (no per-inference billing), privacy (data stays on device), and availability (inference works offline).


Windows ML architecture: How the pieces fit

The lab started with a clear architecture walkthrough before any code.

The layers:

Application layer: Your code. C++, C#, or any language that can call the Windows ML API via WinRT interop.

Windows ML API: The abstraction layer. You submit a model and input data. Windows ML handles execution provider selection, model compilation, memory management, and inference execution.

Execution Providers (EPs): Hardware-specific backends that translate model operations into hardware instructions. Each EP targets a specific processor type.

Hardware: CPU, GPU, or NPU. The execution provider handles the translation from model operations to hardware-specific instructions.

┌──────────────────────────┐
│    Your Application      │
├──────────────────────────┤
│    Windows ML API        │
├──────────────────────────┤
│   Execution Providers    │
│  ┌────┐ ┌────┐ ┌─────┐  │
│  │CPU │ │GPU │ │ NPU │  │
│  │ EP │ │ EP │ │ EP  │  │
│  └────┘ └────┘ └─────┘  │
├──────────────────────────┤
│    Hardware               │
│  ┌────┐ ┌────┐ ┌─────┐  │
│  │CPU │ │GPU │ │ NPU │  │
│  └────┘ └────┘ └─────┘  │
└──────────────────────────┘

The key design decision: Execution providers are dynamically downloadable. Your application does not ship with hardware-specific inference code. At runtime, Windows ML detects the available hardware, downloads the appropriate execution provider, and uses it. When hardware vendors release updated EPs with performance improvements, your application benefits without redeployment.

// Dynamically selecting the best available execution provider
using Microsoft.Windows.AI;

var model = await LearningModel.LoadFromFilePath("model.onnx");

// Let Windows ML choose the best EP for available hardware
var device = new LearningModelDevice(LearningModelDeviceKind.Default);

// Or explicitly request NPU if available
var npuDevice = new LearningModelDevice(LearningModelDeviceKind.DirectXNpu);

var session = new LearningModelSession(model, npuDevice);

The lab exercises: Image classification on NPU

The lab's practical exercises built an image classification application that ran inference on the device's NPU.

Exercise 1: CPU baseline

Participants loaded a MobileNetV2 image classification model (ONNX format) and ran inference on the CPU. This established a performance baseline.

// CPU baseline inference
var model = await LearningModel.LoadFromFilePath("mobilenetv2.onnx");
var cpuDevice = new LearningModelDevice(LearningModelDeviceKind.Cpu);
var session = new LearningModelSession(model, cpuDevice);

// Prepare input image
var inputImage = await StorageFile.GetFileFromPathAsync("test_image.jpg");
var videoFrame = await CreateVideoFrameFromFile(inputImage);

var binding = new LearningModelBinding(session);
binding.Bind("input", videoFrame);

// Run inference and measure time
var stopwatch = Stopwatch.StartNew();
var results = await session.Evaluate(binding, "inference_run");
stopwatch.Stop();

var output = results.Outputs["output"] as TensorFloat;
var topPrediction = GetTopPrediction(output);

Console.WriteLine($"Prediction: {topPrediction.Label}");
Console.WriteLine($"Confidence: {topPrediction.Score:P1}");
Console.WriteLine($"Inference time (CPU): {stopwatch.ElapsedMilliseconds}ms");

CPU baseline results on lab machines (Snapdragon X Elite):

  • MobileNetV2 inference: ~45ms per image
  • ResNet50 inference: ~180ms per image
  • Batch of 10 images: ~1,700ms total

Exercise 2: NPU acceleration

The same model, same images, but running on the NPU execution provider.

// NPU-accelerated inference
var model = await LearningModel.LoadFromFilePath("mobilenetv2.onnx");
var npuDevice = new LearningModelDevice(LearningModelDeviceKind.DirectXNpu);
var session = new LearningModelSession(model, npuDevice);

// Same binding and evaluation code — only the device changed
var binding = new LearningModelBinding(session);
binding.Bind("input", videoFrame);

var stopwatch = Stopwatch.StartNew();
var results = await session.Evaluate(binding, "npu_inference_run");
stopwatch.Stop();

Console.WriteLine($"Inference time (NPU): {stopwatch.ElapsedMilliseconds}ms");

NPU results on lab machines:

  • MobileNetV2 inference: ~8ms per image (5.6x faster than CPU)
  • ResNet50 inference: ~35ms per image (5.1x faster)
  • Batch of 10 images: ~320ms total (5.3x faster)

The first-run penalty: The initial NPU inference took significantly longer (~500ms for MobileNetV2) due to model compilation for the NPU target. Subsequent inferences used the compiled model and achieved the ~8ms performance. This is important for application design — you need to warm up the model before latency-sensitive workloads.

Exercise 3: Model compilation for NPU targets

This was the most technically interesting exercise. Rather than letting Windows ML compile the model at runtime, participants pre-compiled models for the specific NPU hardware.

// Pre-compiling a model for a specific NPU target
using Microsoft.Windows.AI.MachineLearning;

var compiler = new ModelCompiler();

var compilationOptions = new CompilationOptions
{
    TargetDevice = CompilationTarget.Npu,
    OptimizationLevel = OptimizationLevel.Maximum,
    QuantizationMode = QuantizationMode.Int8  // Quantise for NPU efficiency
};

// Compile and save the optimised model
var compiledModel = await compiler.CompileAsync(
    "mobilenetv2.onnx",
    compilationOptions
);

await compiledModel.SaveAsync("mobilenetv2_npu_compiled.bin");

Pre-compiled model results:

  • MobileNetV2 inference: ~5ms per image (9x faster than CPU, 1.6x faster than runtime compilation)
  • No first-run penalty — model is already compiled for the target hardware
  • Model file size reduced by ~40% through INT8 quantisation

The trade-off: Pre-compiled models are hardware-specific. A model compiled for Snapdragon X Elite's NPU will not run on an Intel Meteor Lake NPU. You need to compile separate versions for each target hardware or rely on runtime compilation for portability.

The quantisation question: INT8 quantisation reduced the model size and improved inference speed, but accuracy dropped measurably. The lab's MobileNetV2 classification accuracy went from 91.2% (FP32) to 89.7% (INT8). For the lab's image classification task, this 1.5% accuracy loss was acceptable. For safety-critical applications, it might not be.


Performance analysis: Where NPU acceleration matters

The lab results tell a clear story about where NPU acceleration changes the calculus and where it does not.

Where NPU wins convincingly

High-frequency inference on small models: Applications that run inference continuously — real-time video classification, ambient audio processing, keystroke pattern analysis — benefit enormously from NPU acceleration. The 5-8ms inference time on MobileNetV2 means ~125-200 inferences per second without impacting CPU performance.

Battery-constrained scenarios: NPU inference uses significantly less power than CPU inference for the same workload. The lab did not provide precise power measurements, but Qualcomm's published specifications for the Snapdragon X Elite NPU show approximately 4x better performance-per-watt compared to CPU inference.

Privacy-sensitive workloads: Any inference that processes sensitive data — document classification, facial recognition for authentication, health data analysis — benefits from running locally rather than sending data to a cloud endpoint. NPU acceleration makes local inference fast enough to be a viable alternative to cloud inference for many models.

Where NPU acceleration is less compelling

Large language models: The lab's NPU accelerated image classification models (1-25 million parameters). Current laptop NPUs cannot efficiently run large language models (7 billion+ parameters). The NPU's memory capacity and bandwidth are insufficient for LLM inference. This is why Copilot+ PCs still use cloud inference for Copilot chat.

Training: NPUs are optimised for inference, not training. Fine-tuning or training models locally still requires a GPU.

Infrequent inference: If your application runs inference once per user action (e.g., classifying a document when the user uploads it), the difference between 45ms (CPU) and 8ms (NPU) is imperceptible to the user. NPU acceleration matters for continuous inference, not occasional inference.

Models that do not fit the NPU architecture: Some model architectures map poorly to NPU hardware. The lab did not explore which architectures work well and which do not, but in general, convolutional neural networks and transformer attention blocks map well to NPUs, whilst architectures with irregular memory access patterns or dynamic computation graphs are less suited.


The developer experience: Honest assessment

What works well:

The abstraction layer is effective. Switching from CPU to NPU inference required changing a single line of code (the device kind). The Windows ML API successfully abstracts the hardware complexity. A developer does not need to understand NPU architecture to use NPU acceleration.

ONNX as the model format is the right choice. ONNX is an open format supported by all major ML frameworks (PyTorch, TensorFlow, JAX). Models trained in any framework can be exported to ONNX and run through Windows ML. This avoids the framework lock-in that plagues other edge inference platforms.

Dynamic EP download is genuinely useful. Applications do not need to ship hardware-specific code. Windows ML downloads the right execution provider at runtime. When Qualcomm or Intel release improved EPs, applications benefit automatically.

What needs improvement:

Error messages are unhelpful. When a model fails to compile for the NPU, the error message was typically a generic "compilation failed" without explaining which operations are unsupported, which layers caused the failure, or how to modify the model for compatibility. Lab participants who hit compilation errors spent significant time debugging without clear guidance.

Debugging tools are primitive. There is no equivalent of GPU profiling tools (NVIDIA Nsight, AMD Radeon GPU Profiler) for NPU workloads. You cannot visualise which layers are running on the NPU, identify bottlenecks, or understand memory utilisation patterns. This makes optimisation a trial-and-error exercise.

Documentation is sparse. The Windows ML documentation covers the API surface adequately but provides minimal guidance on model optimisation for NPU targets, quantisation strategies, or performance tuning. The lab filled this gap with hands-on guidance that is not available in public documentation.

Hardware variation is a real problem. The lab ran on identical hardware (Snapdragon X Elite laptops). In production, your application will encounter Snapdragon X, Intel Meteor Lake, AMD Ryzen AI, and future NPU architectures. Each has different capabilities, different supported operations, and different performance characteristics. Windows ML's abstraction layer handles this in theory, but the lab did not test across hardware variants.


The Copilot+ PC strategy: Hardware as moat

LAB584-R1 made sense as a standalone lab, but its strategic significance only becomes clear in the context of Microsoft's Copilot+ PC initiative.

The strategy: Require NPU hardware in Copilot+ branded PCs. Provide Windows ML as the inference layer. Make it easy for developers to build applications that leverage NPU hardware. Result: Copilot+ PCs run AI-powered applications that non-Copilot PCs cannot, creating a hardware upgrade cycle driven by AI capability rather than traditional performance improvements.

The enterprise angle: If your workforce has Copilot+ PCs, you can deploy AI-powered applications that run inference locally — document classification, real-time translation, meeting transcription, data anomaly detection — without per-inference cloud costs and without sending data to external services. This is a genuine enterprise value proposition, not just a consumer marketing story.

The competitive positioning: Apple has Neural Engine in every Mac and iPhone. Google has TPUs in Pixel devices and Tensor chips. Microsoft does not make the hardware but controls the software layer (Windows ML) that makes the hardware useful. By standardising the NPU requirement through Copilot+ PC certification and providing the development framework, Microsoft creates the developer ecosystem that makes NPU hardware valuable.

The risk for Microsoft: If developers do not build NPU-accelerated applications, the NPU hardware sits idle and the Copilot+ PC value proposition weakens. LAB584-R1 exists because Microsoft needs developers to actually use this hardware, and the current developer tooling is not yet compelling enough to drive adoption without hands-on training.


Does NPU acceleration change the economics of edge AI?

This is the question the lab raised but did not directly answer. Let me work through the economics.

The cloud inference cost baseline

Using Azure OpenAI as the reference point for an image classification workload:

  • GPT-4o with vision: ~$0.005 per image (input tokens for image + output tokens for classification)
  • Processing 1,000 images per day: ~$5 per day, ~$150 per month, ~$1,800 per year
  • Processing 10,000 images per day: ~$1,500 per month, ~$18,000 per year

The NPU inference cost

  • Hardware cost: Already included in the laptop purchase (NPU is standard in Copilot+ PCs)
  • Inference cost: Zero marginal cost — electricity is negligible
  • Development cost: One-time development of the Windows ML application
  • Model cost: Open-source models (MobileNet, ResNet, EfficientNet) are free

The comparison

For high-volume inference workloads (thousands of inferences per day), NPU acceleration eliminates recurring cloud costs entirely. The break-even point depends on the specific workload, but for applications processing more than a few hundred inferences per day, local NPU inference is dramatically cheaper than cloud inference.

The caveat: This comparison is only valid for workloads that can run on NPU-compatible models. If you need GPT-4o-level reasoning for your classification task, the NPU cannot help you. NPU acceleration is for well-defined inference tasks with models that fit the NPU's capability profile — classification, detection, segmentation, embedding generation, and similar computer vision and signal processing tasks.

The other caveat: Cloud inference provides model updates without redeployment. When OpenAI improves GPT-4o, your application benefits immediately. With local inference, model updates require retraining, re-exporting to ONNX, recompiling for the target NPU, and redeploying. The operational overhead of model lifecycle management for local inference is non-trivial.


What I want to see next

Cross-hardware benchmarks: Performance data across Snapdragon X Elite, Intel Meteor Lake, and AMD Ryzen AI NPUs, for the same models and workloads. This is essential for enterprise developers who need to support heterogeneous device fleets.

LLM inference on NPU: Even if full LLM inference is not feasible on current NPUs, smaller language models (1-3 billion parameters) might be. SLM inference on NPU for local document summarisation, code completion, or classification would be a compelling capability.

Power consumption data: Real measurements of NPU inference power consumption versus CPU inference, under sustained workloads. Battery life impact is a key factor for mobile enterprise deployments.

Production deployment guidance: How do you manage model versioning, compilation targets, and deployment across a fleet of devices with different NPU hardware? The lab covered single-device development; enterprise deployment is a different challenge entirely.

Integration with Windows ML and cloud inference: A pattern where the application tries NPU inference first and falls back to cloud inference for models or workloads that exceed the NPU's capabilities. This hybrid pattern would make NPU acceleration practical for a broader range of applications.


The verdict

LAB584-R1 demonstrated that NPU acceleration is real, practical, and ready for specific workload profiles. Image classification at 5-8ms per inference, with no cloud costs and no data leaving the device, is genuinely compelling for the right use cases.

The developer experience is functional but immature. The abstraction layer works, ONNX compatibility is excellent, and dynamic EP management is well-designed. But debugging tools, documentation, and cross-hardware testing support are all insufficient for production development today.

The economics are favourable for high-volume, well-defined inference workloads. If you are currently paying for cloud inference on classification, detection, or embedding tasks, NPU acceleration deserves serious evaluation.

The strategic picture is clear: Microsoft is building the developer ecosystem to make Copilot+ PC hardware valuable. Whether this succeeds depends on whether enough developers build NPU-accelerated applications to create a virtuous cycle between hardware capability and application value. LAB584-R1 is Microsoft's attempt to kick-start that cycle, and the technical foundation is solid enough to be worth the bet.


Related Coverage:


Session: LAB584-R1 | Nov 20, 2025 | Moscone South, Level 3, Room 308

Previous
Windows AI Future
Built: Mar 13, 2026, 12:43 PM PDT
80d1fe5