AI Training in Kubernetes
Bringing Together the AI/ML Kubernetes Community
At KubeCon North America 2024, Frederick Kautz co-led a Birds-of-a-Feather (BoF) session that brought together some of the brightest minds working on AI training in Kubernetes. Joined by Ricardo Rocha from CERN and Alex Scammon from G-Research, this interactive session tackled the real-world challenges of running large-scale AI training workloads on Kubernetes.
This wasn't your typical conference talk - BoF sessions are interactive discussions where practitioners share experiences, challenges, and solutions. The combination of perspectives from TestifySec (security and compliance), CERN (massive-scale scientific computing), and G-Research (financial modeling) created a rich dialogue about the state of AI infrastructure.
Why This Discussion Matters
As organizations rush to adopt AI, many discover that Kubernetes - while excellent for traditional microservices - wasn't originally designed for the unique demands of ML workloads. Training jobs that run for days or weeks, require expensive GPU resources, and process terabytes of data push Kubernetes to its limits. This session captured the community's collective wisdom on overcoming these challenges.
What emerged was a candid discussion about the gaps between what ML engineers need and what Kubernetes currently provides, along with innovative solutions being developed by organizations at the cutting edge of AI infrastructure.
Key Takeaways
Kubernetes needs specialized schedulers and operators to handle the unique requirements of AI training workloads
GPU resource management remains one of the biggest challenges, requiring careful orchestration and sharing strategies
Storage performance and data locality are critical bottlenecks that can make or break AI training efficiency
Multi-tenancy for AI workloads requires new isolation strategies beyond traditional Kubernetes namespaces
The community is converging on common patterns for distributed training, but standardization is still evolving
Security considerations for AI training include data access controls, model IP protection, and compute isolation
Watch the Full Presentation
45 minutes of insights on AI
Video Coming Soon
The video recording for this talk is not yet available. Conference recordings are typically posted within a few weeks after the event.
Check back soon or follow TestifySec on social media for updates when the video becomes available.
About the Speaker
Session Leaders
Frederick Kautz
Director of R&D, TestifySec
Fred brings deep expertise in Kubernetes security and software supply chain protection to the AI/ML space, focusing on how to secure and verify AI training pipelines at scale.
Ricardo Rocha
Engineer, CERN
Ricardo works on CERN's massive Kubernetes infrastructure, supporting both traditional physics simulations and cutting-edge ML workloads for particle physics research. His team manages one of the world's largest scientific computing clusters.
Alex Scammon
Engineer, G-Research
Alex leads infrastructure development at G-Research, where Kubernetes powers quantitative research and trading systems. His work focuses on high-performance computing and ML infrastructure for financial modeling.