Post-mortems as a routine practice in software development

Think about a time where your error tracking software reported a crash in your application, along with a stack trace. The error message might have been enough to point down the issue, or you might have spent some time studying the stack trace and the code, eventually came up with a fix, sent a patch, and moved on.

But, how many times did you go back to the set of changes that introduced the problem and actively thought about how you could have prevented it on the first place? There are excellent tools to help you find the first “bad” patch, such as git bisect, but developers promote them as tools to aid in writing a fix, and not as tools to reflect about the development process.

Post-mortems are a well known practice to reflect over what went wrong and correct course. They encourage a good amount of introspection, and are usually great for identifying and fixing various process and cultural issues at their root. The problem? Post-mortems are rarely periodically applied, and we tend to not think about it in all but the most catastrophic failures, such as major incidents and failed startups.

Error reports are great opportunities to reap the benefits of this practice on a lower-level. Imagine if your error reporting software would parse stack traces, determine the version of the code that caused the error for the first time, find the commits included on that release, analyse the diffs, and post comments on the relevant pull-request explaining what happened, right on the lines that the stack trace mentions. Like this:

Inline error information as a pull-request comment

The authors and reviewers then receive a notification. The pull-request is not only the best place to reflect about what went wrong, but also the most contextual one to try to debug the problem, as it includes the description of the change, links to the issues it fixed, build and coverage results, and more.

Now pair that up with a tight-knit team that allocates time every week to go together through all the new errors reported in recently merged PRs, brainstorming ways in which the problems could have prevented. The entire team would gain a deeper understanding about the errors happening in production and what caused them, and a preventive rather than reactive culture where thinking about how to write good code is a first class citizen of the development process.

Imagine what such a team would achieve.