Humans of Reliability
Humans of Reliability
Podcast Description
Behind every reliable software system, there are people working hard to keep it online. Humans of Reliability is a series that spotlights the engineers, leaders, and innovators at the heart of incident management and system reliability. Through candid conversations, we explore the challenges, lessons, and personal journeys of those navigating complex technical landscapes to ensure the systems we rely on run smoothly. From unforgettable incident stories to favorite tools, workflows, and hobbies, Humans of Reliability uncovers the human side of technology—offering insights and inspiration for anyone passionate about building and maintaining resilient systems.https://rootly.com/humans-of-reliability
Podcast Insights
Content Themes
The show focuses on themes such as incident management, reliability engineering, and personal journeys within the tech industry, with episode examples including insights on SRE practices from Google’s Steve McGee and the impact of mentorship in tech leadership from Hannah Hammonds, highlighting key challenges and tools used in the field.

Behind every reliable software system, there are people working hard to keep it online.
Humans of Reliability is a series that spotlights the engineers, leaders, and innovators at the heart of incident management and system reliability. Through candid conversations, we explore the challenges, lessons, and personal journeys of those navigating complex technical landscapes to ensure the systems we rely on run smoothly.
From unforgettable incident stories to favorite tools, workflows, and hobbies, Humans of Reliability uncovers the human side of technology—offering insights and inspiration for anyone passionate about building and maintaining resilient systems.
https://rootly.com/humans-of-reliability
Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.
In this episode, Tomás Hernando Koffman, Co-founder of Not Diamond, breaks down what it really takes to reach 99%+ accuracy when the underlying model is a moving target. He explains why non-determinism, model churn, and compounding errors make LLM reliability fundamentally different from classical software. And why “good enough” accuracy quickly falls apart in production.
Drawing from an SRE mindset, Tomás walks through a practical framework for building reliable LLM applications: evaluations and golden datasets, semantic metrics, structured context, and workflow design that treats prompts and instructions as part of the system architecture. We also dive into prompt optimization, why manual tuning doesn’t scale, and how teams can systematically improve accuracy even as models keep changing underneath them.

Disclaimer
This podcast’s information is provided for general reference and was obtained from publicly accessible sources. The Podcast Collaborative neither produces nor verifies the content, accuracy, or suitability of this podcast. Views and opinions belong solely to the podcast creators and guests.
For a complete disclaimer, please see our Full Disclaimer on the archive page. The Podcast Collaborative bears no responsibility for the podcast’s themes, language, or overall content. Listener discretion is advised. Read our Terms of Use and Privacy Policy for more details.