NVIDIA AI Cluster Runtime

Tracked

NVIDIA AI Cluster Runtime (AICR) generates optimized, validated, and reproducible GPU-accelerated Kubernetes cluster configurations for AI training and inference.

Author NVIDIA Open Sourced 2026-01-30 Last Commit Unknown

Overview

NVIDIA AI Cluster Runtime (AICR) makes it easy to stand up GPU-accelerated Kubernetes clusters by capturing known-good combinations of drivers, operators, kernels, and system configurations as version-locked recipes. It generates reproducible deployment artifacts for Helm, Argo CD, Flux, and Helmfile, solving the hardest problem in AI infrastructure — environment consistency.

Key Features

  • Recipe engine generating version-locked GPU K8s configurations validated by NVIDIA.
  • Multi-deployer bundles for Helm, Argo CD, Flux, and Helmfile.
  • Multi-phase validation covering deployment, performance (training and inference), and conformance.
  • Drift detection comparing cluster snapshots to surface configuration changes.
  • Supply chain security with SLSA Level 3 provenance, signed SBOMs, and Cosign attestations.

Use Cases

  • Stand up validated GPU K8s clusters for AI training or inference in minutes.
  • Ensure reproducible GPU environments across teams and regions.
  • Detect and remediate configuration drift in production GPU clusters.
  • Integrate GPU infrastructure provisioning into CI/CD and GitOps pipelines.

Technical Details

  • Single CLI binary for full workflow: snapshot, recipe, bundle, validate, verify, diff.
  • Supports EKS, GKE, AKS, Kind, and more with H100, B200, GB200, A100 accelerators.
  • Composable overlay architecture: base defaults layered with cloud, accelerator, OS, and workload tuning.
  • Go SDK available for programmatic integration without subprocess or REST calls.