October 2024KubeCon North America BoF45 minutes

AI Training in Kubernetes

Frederick Kautz, TestifySec
Co-presented with Ricardo Rocha, CERN; Alex Scammon, G-Research

Bringing Together the AI/ML Kubernetes Community

At KubeCon North America 2024, Frederick Kautz co-led a Birds-of-a-Feather (BoF) session that brought together some of the brightest minds working on AI training in Kubernetes. Joined by Ricardo Rocha from CERN and Alex Scammon from G-Research, this interactive session tackled the real-world challenges of running large-scale AI training workloads on Kubernetes.

This wasn't your typical conference talk - BoF sessions are interactive discussions where practitioners share experiences, challenges, and solutions. The combination of perspectives from TestifySec (security and compliance), CERN (massive-scale scientific computing), and G-Research (financial modeling) created a rich dialogue about the state of AI infrastructure.

Why This Discussion Matters

As organizations rush to adopt AI, many discover that Kubernetes - while excellent for traditional microservices - wasn't originally designed for the unique demands of ML workloads. Training jobs that run for days or weeks, require expensive GPU resources, and process terabytes of data push Kubernetes to its limits. This session captured the community's collective wisdom on overcoming these challenges.

What emerged was a candid discussion about the gaps between what ML engineers need and what Kubernetes currently provides, along with innovative solutions being developed by organizations at the cutting edge of AI infrastructure.

Key Takeaways

1

Kubernetes needs specialized schedulers and operators to handle the unique requirements of AI training workloads

2

GPU resource management remains one of the biggest challenges, requiring careful orchestration and sharing strategies

3

Storage performance and data locality are critical bottlenecks that can make or break AI training efficiency

4

Multi-tenancy for AI workloads requires new isolation strategies beyond traditional Kubernetes namespaces

5

The community is converging on common patterns for distributed training, but standardization is still evolving

6

Security considerations for AI training include data access controls, model IP protection, and compute isolation

Watch the Full Presentation

45 minutes of insights on AI

Video Coming Soon

The video recording for this talk is not yet available. Conference recordings are typically posted within a few weeks after the event.

Check back soon or follow TestifySec on social media for updates when the video becomes available.

About the Speaker

Session Leaders

Frederick Kautz

Director of R&D, TestifySec

Fred brings deep expertise in Kubernetes security and software supply chain protection to the AI/ML space, focusing on how to secure and verify AI training pipelines at scale.

Ricardo Rocha

Engineer, CERN

Ricardo works on CERN's massive Kubernetes infrastructure, supporting both traditional physics simulations and cutting-edge ML workloads for particle physics research. His team manages one of the world's largest scientific computing clusters.

Alex Scammon

Engineer, G-Research

Alex leads infrastructure development at G-Research, where Kubernetes powers quantitative research and trading systems. His work focuses on high-performance computing and ML infrastructure for financial modeling.

Want to Learn More About Our Solutions?

Schedule a demo to see how TestifySec can help secure your software supply chain and automate compliance.