Humans of Reliability
Humans of Reliability
Podcast Description
Behind every reliable software system, there are people working hard to keep it online. Humans of Reliability is a series that spotlights the engineers, leaders, and innovators at the heart of incident management and system reliability. Through candid conversations, we explore the challenges, lessons, and personal journeys of those navigating complex technical landscapes to ensure the systems we rely on run smoothly. From unforgettable incident stories to favorite tools, workflows, and hobbies, Humans of Reliability uncovers the human side of technology—offering insights and inspiration for anyone passionate about building and maintaining resilient systems.https://rootly.com/humans-of-reliability
Podcast Insights
Content Themes
The show focuses on themes such as incident management, reliability engineering, and personal journeys within the tech industry, with episode examples including insights on SRE practices from Google’s Steve McGee and the impact of mentorship in tech leadership from Hannah Hammonds, highlighting key challenges and tools used in the field.

Behind every reliable software system, there are people working hard to keep it online.
Humans of Reliability is a series that spotlights the engineers, leaders, and innovators at the heart of incident management and system reliability. Through candid conversations, we explore the challenges, lessons, and personal journeys of those navigating complex technical landscapes to ensure the systems we rely on run smoothly.
From unforgettable incident stories to favorite tools, workflows, and hobbies, Humans of Reliability uncovers the human side of technology—offering insights and inspiration for anyone passionate about building and maintaining resilient systems.
https://rootly.com/humans-of-reliability
Only 50% of companies monitor their ML systems. Building observability for AI is not simple: it goes beyond 200 OK pings. In this episode, Sylvain Kalache sits down with Conor Brondsdon (Galileo) to unpack why observability, monitoring, and human feedback are the missing links to make large language model (LLM) reliable in production.
Conor dives into the shift from traditional test-driven development to evaluation-driven development, where metrics like context adherence, completeness, and action advancement replace binary pass-fail checks. He also shares how teams can blend human-in-the-loop feedback, automated guardrails, and small language models to keep AI accurate, compliant, and cost-efficient at scale.

Disclaimer
This podcast’s information is provided for general reference and was obtained from publicly accessible sources. The Podcast Collaborative neither produces nor verifies the content, accuracy, or suitability of this podcast. Views and opinions belong solely to the podcast creators and guests.
For a complete disclaimer, please see our Full Disclaimer on the archive page. The Podcast Collaborative bears no responsibility for the podcast’s themes, language, or overall content. Listener discretion is advised. Read our Terms of Use and Privacy Policy for more details.