Root Cause Analysis: Case Study

The process of investigating non-conformance by digging in deep to find out the cause that started the chain of action is root cause analysis. It is a methodology that allows for efficient production in the areas of the healthcare industry, telecommunications, manufacturing, software development, IT operations, etc. It makes the systems and processes more reliable, enhancing the quality of your products and services. Let us take a look at a root cause analysis example:

Case Study

PagerDuty is an American computer software company. They had a country launch in 25 hours when 5XX requests started coming in, and their public API became unreachable. Upon superficial investigation, it was found that most of the logs were coming in at 1 Mbps, but there was no correlation ID to isolate the exact place of the issue. The system was on auto-resolve, so it was not long before the HTTP 500 error stopped and things went back to normal.

However, only after 5 minutes, it happens again, and a repetitive chain of auto-resolution and HTTP 500 errors started. The disaster management and systems quality teams initially thought there must have been a new deployment that was causing the issue, but the SRE denied any new deployment taking place. What was interesting was that all the supporting tools were working fine, and the firewall did not show a drop in traffic. 

The teams quickly resorted to performing a root cause analysis, using different techniques, and pinpointed the issue within 20 hours. The problem was caused by the absence of a mount command on a database shard. This caused in-memory writing of data that got wiped as soon as a reboot of the system took place. 

The quality and management teams gathered as much information about the problem as they could, and the piece that helped them was that it was just a specific data section that was missing. That meant that it was only a specific type of request that was failing. Through root cause analysis, they found that the machine that they run commands on was faulty. It was newly provisioned, and the service team had not discovered the issue with it. However, the root cause analysis does not stop there as there is still a why left. And potential answers (root causes) included:

  • The individual who was responsible for executing the mount command did not do their job correctly.
  • The whole infrastructure team is at fault.

The two root causes were clearly human faults, and the only solution for that to make sure it does not happen again is to replace the individuals. However, it is not a sure-shot solution as human mistakes are bound to happen, no matter how skilled a person or how efficient a team is. The problem was identified, and the next step of devising a strategy and plan to solve it was to follow. Before the teams came up with a remediation plan, they had to make sure that the mistake was not repeated in other places and that the strategy included a long-term solution that kept the issue from taking place again. 

The teams came up with the following potential solutions:

  • Disallow SSH
  • Build a system configuration validator

The problem with the first solution was that it could hinder the work of the production, DevOps, and QA teams as they needed to SSH into the system to perform their basic job duties right. That could not have helped with efficient production and quality management. The second solution was perfect, though, so the company built developed system configuration validators and incorporated frequently running them over their systems as a part of their routine. 

This shows us that with the help of root cause analysis, software, IT, manufacturing, and all other types of corporations can ensure efficient production and maintain quality management processes that keep failures and problems in the systems at bay.

You may be interested in: Competitor Analysis: What you need to know.