Operational Excellence (OE) Culture at Amazon
How Amazon Uses Operational Excellence Meetings, Key Metrics, and Continuous Improvement to Maintain High Standards and Swiftly Resolve Issues
Introduction
Operational Excellence (OE) is a core value at Amazon, ensuring high standards in service delivery and operational efficiency. During my three years as a Software Developer at Amazon, I saw firsthand how the OE culture is embedded in the daily practices of software teams. This essay explores Amazon's OE culture, the structured processes followed, and the impact of these practices on overall performance, with a focus on my experience as part of a platform team.
Weekly OE Meetings and On-Call Responsibilities
At the heart of Amazon's OE culture are the weekly OE meetings, typically held on Monday mornings. These meetings are crucial for addressing issues from the previous week promptly. Each week, a designated software engineer takes on the role of the on-call person, responsible for OE activities. This on-call duty rotates among team members, with each person spending about 20% of their time on these tasks. Using PagerDuty, the on-call engineer receives alerts for high-severity issues, ensuring swift responses and maintaining operational standards.
Key Metrics: Availability, Percentiles, and Error Rates
For platform teams like ours, three key metrics were crucial: service availability, response times, and error rates. Service availability percentage measured the uptime of our services, ensuring they were accessible to users. The 99th and 99.9th percentiles of service response times indicated the performance of our services, focusing on the worst-case scenarios to ensure even the slowest responses were within acceptable limits. Additionally, we aimed for zero error rates, tracking both system errors (originating from our own services) and dependency errors (caused by external services we relied on). These metrics were consistently reviewed during OE meetings. If any downtime or performance degradation occurred, we analysed the root cause and reviewed related tickets or alarms. Our goal was to automate monitoring and alarms as much as possible, ensuring that our systems informed us of issues before customers were affected. This proactive approach allowed us to catch issues early and maintain high standards of service reliability and performance.
Review of Customer Queries and Production Issues
During OE meetings, we started by reviewing key performance metrics, such as system uptime, response times, and customer satisfaction scores. This helped us identify trends or anomalies that needed attention. After this review, we examined customer queries and production issues from the past week. Issues were prioritised based on impact and frequency, with the most critical ones selected for detailed discussion. This method also helped inform leadership and management of any very high-severity issues or problems affecting the product or company. By keeping leadership up to speed, they could help mitigate issues and identify the right action items swiftly.
Ticket Template for Issue Tracking
For all tickets received, a specific template was followed to ensure clarity and completeness. This template included:
Description of the Problem: A detailed account of the issue.
Root Cause: Identification of the underlying cause.
Mitigation: Immediate actions taken to address the issue.
Customer Impact: Assessment of how customers were affected.
Action Items: Steps to prevent recurrence of the issue.
Having a clear ticket summary was crucial so that everyone could easily understand the problem and provide critical inputs to optimise the overall solution. In meetings, summaries of high-severity issues were read aloud, ensuring that all participants were fully aware of the details and could contribute effectively to the discussion. Additionally, we tracked the number of tickets received and the total number of action items for all tickets. Even for low-severity issues, we aimed to reduce the number of tickets through automation and automatic fixes.
Root Cause Analysis (RCA) and Long-Term Fixes
Conducting RCA is a crucial part of the OE meetings. For each selected issue, the team discusses its context, impact, and initial findings, then assigns responsibility for conducting RCA. We also review RCAs from the previous week to validate root causes and brainstorm potential solutions. Immediate actions to prevent recurrence are decided upon during the meetings, and long-term fixes are planned, with resources and timelines allocated.
Monitoring and Alarms
Monitoring and setting alarms are essential aspects of Amazon's OE culture. Engineers are encouraged to have a clear view of their metrics and to set up monitors and alarms to catch issues early. We maintained a customizable dashboard that displayed all key metrics in one place. Typically, we used Amazon AWS CloudWatch to create these dashboards, but other solutions like Grafana or Datadog could also be used. These dashboards were vital for tracking service availability, response times, and error rates, allowing for quick identification and resolution of issues. Ensuring zero error rates was paramount for our platform team. We tracked system errors and dependency errors separately to pinpoint the exact source of problems within the service chain. This proactive approach enabled faster resolution of problems and helped maintain high operational standards.
Continuous Improvement and Feedback Loop
The OE process at Amazon is designed to foster continuous improvement. Feedback from all participants is encouraged to refine and enhance the OE process. The team tracks progress and updates processes or documentation based on learnings from each issue discussed. This continuous feedback loop ensures that the team is always improving and adapting to new challenges.
Conclusion
Operational Excellence is a fundamental part of Amazon's culture, driving the company to maintain high standards of service and operational efficiency. The structured approach of weekly OE meetings, on-call responsibilities, thorough RCA, and proactive monitoring ensures that issues are addressed promptly and effectively. By embedding these practices into the daily routine, Amazon fosters a culture of continuous improvement and operational excellence. For start-ups and other companies looking to improve their operations, adopting similar practices can lead to significant benefits in service reliability and customer satisfaction.