AWS For AI

AWS For AI
Podcast Description
Decoding The Future of Artificial Intelligence with AWS:Explore the frontiers of artificial intelligence with AWS For AI, your insider guide to the technologies reshaping our world. Each episode brings you face-to-face with the brilliant minds behind groundbreaking AI innovations from pioneering researchers, to executives transforming businesses with generative AI.
Podcast Insights
Content Themes
The podcast focuses on various themes in artificial intelligence, including generative AI, emotional AI, and enterprise workflow automation, with episodes delving into topics like the rise of small language models, human-centric AI applications, and groundbreaking innovations such as the Opus large work model. It provides detailed insights into AI trends and tools, equipping listeners with actionable knowledge relevant to their professional pursuits.

Decoding The Future of Artificial Intelligence with AWS:
Explore the frontiers of artificial intelligence with AWS For AI, your insider guide to the technologies reshaping our world.
Each episode brings you face-to-face with the brilliant minds behind groundbreaking AI innovations from pioneering researchers, to executives transforming businesses with generative AI.
Join us for an enlightening conversation with Anton Alexander, AWS’s Senior Specialist for Worldwide Foundation Models, as we delve into the complexities of training and scaling large foundation models. Anton brings his unique expertise from working with the world’s top model builders, along with his fascinating journey from Trinidad and Tobago to becoming a leading AI infrastructure expert.
Discover practical insights on managing massive GPU clusters, optimizing distributed training, and handling the critical challenges of model development at scale. Learn about cutting-edge solutions in GPU failure detection, checkpointing strategies, and the evolution of inference workloads. Get an insider’s perspective on emerging trends like GRPO, visual LLMs, and the future of AI model development.
Don’t miss this technical deep dive where we explore real-world solutions for building and deploying foundational AI models, featuring discussions on everything from low-level infrastructure optimization to high-level AI development strategies.
Learn more: http://go.aws/47yubYq
Amazon SageMaker HyperPod : https://aws.amazon.com/fr/sagemaker/ai/hyperpod/
The Llama 3 Herd of Models paper : https://arxiv.org/abs/2407.21783
Chapters:
00:00:00 : Introduction and Guest Background
00:01:18 : Anton Journey from Caribbean to AI
00:05:52 : Mathematics in AI
00:07:20 : Large Model Training Challenges
00:09:54 : GPU failures : Lama Herd of models
00:13:40 : Grey failures
00:15:05 : Model training trends
00:17:40 : Managing Mixture of Experts Models
00:21:50 : Estimate how many GPUs you need.
00:25:12 : Monitoring loss function
00:27:08 : Crashing trainings
00:28:10 : SageMaker Hyperpod story
00:32:15 : How we automate managing grey failures
00:37:28 : which metrics to optimize for
00:40:23 : Checkpointing Strategies
00:44:48 : USE Utilization, Saturation, Errors
00:50:11 : SageMaker Hyperpod for Inferencing
00:54:58 : Resiliency in Training vs Inferencing workloads
00:56:44 : NVIDIA NeMo Ecosystem and Agents
00:59:49 : Future Trends in AI
01:03:17 : Closing Thoughts

Disclaimer
This podcast’s information is provided for general reference and was obtained from publicly accessible sources. The Podcast Collaborative neither produces nor verifies the content, accuracy, or suitability of this podcast. Views and opinions belong solely to the podcast creators and guests.
For a complete disclaimer, please see our Full Disclaimer on the archive page. The Podcast Collaborative bears no responsibility for the podcast’s themes, language, or overall content. Listener discretion is advised. Read our Terms of Use and Privacy Policy for more details.