October 2024•KubeCon North America BoF•45 minutes

AI Training in Kubernetes

Frederick Kautz, TestifySec

Co-presented with Ricardo Rocha, CERN; Alex Scammon, G-Research

Bringing Together the AI/ML Kubernetes Community

At KubeCon North America 2024, Frederick Kautz co-led a Birds-of-a-Feather (BoF) session that brought together some of the brightest minds working on AI training in Kubernetes. Joined by Ricardo Rocha from CERN and Alex Scammon from G-Research, this interactive session tackled the real-world challenges of running large-scale AI training workloads on Kubernetes.

This wasn't your typical conference talk - BoF sessions are interactive discussions where practitioners share experiences, challenges, and solutions. The combination of perspectives from TestifySec (security and compliance), CERN (massive-scale scientific computing), and G-Research (financial modeling) created a rich dialogue about the state of AI infrastructure.

Why This Discussion Matters

As organizations rush to adopt AI, many discover that Kubernetes - while excellent for traditional microservices - wasn't originally designed for the unique demands of ML workloads. Training jobs that run for days or weeks, require expensive GPU resources, and process terabytes of data push Kubernetes to its limits. This session captured the community's collective wisdom on overcoming these challenges.

What emerged was a candid discussion about the gaps between what ML engineers need and what Kubernetes currently provides, along with innovative solutions being developed by organizations at the cutting edge of AI infrastructure.

Key Takeaways

Kubernetes needs specialized schedulers and operators to handle the unique requirements of AI training workloads

GPU resource management remains one of the biggest challenges, requiring careful orchestration and sharing strategies

Storage performance and data locality are critical bottlenecks that can make or break AI training efficiency

Multi-tenancy for AI workloads requires new isolation strategies beyond traditional Kubernetes namespaces

The community is converging on common patterns for distributed training, but standardization is still evolving

Security considerations for AI training include data access controls, model IP protection, and compute isolation

Watch the Full Presentation

45 minutes of insights on AI

Video Coming Soon

The video recording for this talk is not yet available. Conference recordings are typically posted within a few weeks after the event.

Check back soon or follow TestifySec on social media for updates when the video becomes available.

About the Speaker

Session Leaders

Frederick Kautz

Director of R&D, TestifySec

Fred brings deep expertise in Kubernetes security and software supply chain protection to the AI/ML space, focusing on how to secure and verify AI training pipelines at scale.

Ricardo Rocha

Engineer, CERN

Ricardo works on CERN's massive Kubernetes infrastructure, supporting both traditional physics simulations and cutting-edge ML workloads for particle physics research. His team manages one of the world's largest scientific computing clusters.

Alex Scammon

Engineer, G-Research

Alex leads infrastructure development at G-Research, where Kubernetes powers quantitative research and trading systems. His work focuses on high-performance computing and ML infrastructure for financial modeling.

More Talks You Might Like

Guardians of the Dataverse: Securing the AI Supply and Data Chain

by Frederick Kautz

Trust No System: The Unsettling Reality of Zero Trust

by Frederick Kautz

Want to Learn More About Our Solutions?

Schedule a demo to see how TestifySec can help secure your software supply chain and automate compliance.

Schedule a Demo Learn More