Scientific reasoning is a rare skill. I'm regularly met with bafflement and blan...

michalprzy · on Aug 2, 2018

Unfortunately I often see developers asking for complex diagnose in urgent emergencies. When it's literally like a 30 minutes reaction time frame. So from business pov there should be first quick and dirty unperfect analysis which is necessary for emergency decisions. Sometimes you can just prolong caching time as quick fix, but sometimes there is no such option. Programmers like nice and calm working environment but business often is very different place. So, please have in mind what business goals are. And yes, I see there is a lot of stupid managers in It and basically you're right, we should use precise data. Just one point: when available...

jon-wood · on Aug 2, 2018

The first thing I'm likely to say to my team when an outage occurs is something along the lines of "stop the bleeding". That might mean bypassing the affected service if possible, rolling back a recent release, or reducing the amount of traffic (we're fortunate enough to have most of our traffic coming from sources we can throw a kill switch on).

However we go about it, the first priority is to give ourselves some space to properly analyse the issue and find the real solution without the rest of the business worrying loudly about things being broken.

aflam · on Aug 2, 2018

Managing those experiences takes effort, otherwise you're sure to end up with the issues you describe. Even when the data makes sense, sometimes people hand you over an unparsable csv with their results, while the conclusions comes from a gut feeling. Sadly not everybody understands standard tools like boxplots, and analyses are often hard to reproduce.

The tooling to make it easy not there yet.

notimetorelax · on Aug 2, 2018

I think you have a very valid point there, but I would argue it is also important not to react to such situations too strongly. This is something April Wensel talks about here [1] .

[1] bit.ly/2v37AzE