Defect management

Is perfection really so hard to achieve that you shouldn't even try?

There's a new fad in the world of Software Development and every company that's worth its salt seems to be jumping on the band wagon. The exact phrasing differs slightly from place to place, but it usually goes something like this:

Done is better than Perfect

The idea is quite simple and I fully understand the sentiment behind it, but I completely disagree with the message that it sends out.

Are "Done" and "Perfect" mutually exclusive; the inference is that we can only have one, and not the other? That the pursuit of perfection is so challenging that we just shouldn't waste time trying to achieve it? You can analyse it extensively.

Some will argue that I'm missing the point that they are trying to make about the need to release, and maybe I am, but I feel that when you communicate with a message as strong as this that on at least some level all people will over-analyse it, and reach potentially alarming conclusions.

The IT industry does not have a problem with people sitting on software, fettling with every last detail in a never-ending pursuit for perfection. You don't have to look hard to find significant flaws in any website or application, regardless of platform or company size.

I think that the majority of users would rather our posters read things like: "Fix one more defect, it will make my experience so much better".

I think that the problem that the industry has mis-identified is actually one of an abundance of indecision, rather than an unnecessary pursuit of perfection.

Companies like Facebook are strong proponents of 'Done is better than perfect' - there are posters are all over their office walls saying just that. While there is no denying Facebook's success as a company, correlation does not equal causation and I ask you whether another company in another industry could get away with the decisions and mistakes that Facebook make?

You can afford to get a lot of things wrong when you've got a monopoly, or at least a significant market share, but that's something that most companies don't have the luxury of; there's almost always a viable competitor waiting to capitalise on every imperfection in their product and process.

So what would my posters say?

Strive for perfection without becoming overly obsessed by it

It sends out a more positive message and comes closer to addressing the root cause, while simultaneously promoting its own cause - it arguably isn't a perfect slogan but it serves the intended purpose, and any more time spent on it would arguably be wasted.

But while it is closer, it still doesn't really address the root cause. Nobody wants to be the person who brings down the site, costing the company money; a degree of cautiousness, indecision and fear is somewhat expected. With great power comes great responsibility.

If a company is keen to achieve frequent releases with a degree of controlled risk it shouldn't be achieved by saying 'imperfect releases are good', but rather by having a culture where 'mistakes are ok, as long as we learn from them'. It's not a problem that can be solved by posters, but rather should be an engrained culture that is fully supported and embraced from the bottom to the top of the organisation.

Sometimes the biggest critics aren't management, but rather a team's peers who openly criticise release decisions. Granted this is sometimes warranted, but their energy would be better spent creating a more supportive and collaborative environment where mistakes don't happen in the first place. As William Rogers did after the Challenger Disaster.

One of the biggest threats that comes from accepting imperfect software is that the line between what is acceptable and what is unacceptable becomes increasingly blurred. If we imagine an arbitrary quality (or user-satisfaction) scale of 0-100%, when you start accepting quality of 95% then you'll soon start accepting 93%, then 92%, etc., until ultimately your users choose a competitor. When you strive for perfection, it's still unlikely that you'll actually achieve it (in everybody's eyes at least) but you have at least a chance of getting close.

Imperfections are like broken windows - when you start accepting them they quickly appear everywhere

A/B testing, canary deployments (deploying to a subset of traffic as a [final] verification that the release is ok) and many other tools available to us today make achieving perfection easier than ever to achieve, and yet the industry seems content on using these tools as justification for releasing things that just aren't right. There's only so many times that you can frustrate and disappoint 0.1% of your active users before you've done irreparable damage. Imagine if a restaurant poisoned 0.1% of their customers - how long would they be in business?

If my case wasn't strong enough on its own, Facebook is down as I write this... http://techcrunch.com/2014/09/03/why-is-facebook-down/?ncid=rss

Metrics that matter in Scrum

One of the common problems with Agile, especially in larger companies is a lack of reporting and a natural source of metrics.

It's a natural side-effect of reduced documentation and reduced waste, but for companies that have recently moved from waterfall, or simply rely on data for decision making it can prove problematic.

A lack of awareness can also cause teams to miss slowly growing problems such as increasing defect backlogs.

If you're not interested in basic reports/metrics like burndown charts and velocity then you're probably not really doing Scrum, but there are other metrics that are very important that you might not be looking at - defect backlog size, time between stories being 'Development complete' and 'Test complete', and automated test execution time to name but a few.

Gathering metrics when you aren't using management tools such as Jira or TFS can be difficult, but there is a reasonable chance that if you are managing to keep on top of things without relying on tooling to help you that things are going well.

It's arguably better to have things to be working well but to not really know why, than for things to be failing but for you to understand the specific reason(s).

However even if you are using tooling to help you, gathering metrics probably isn't terribly easy, although the data is at least being stored. Across several companies and teams I have used a number of tools, most notably Jira, TFS and Trello. Each one has its advantages and disadvantages, but a common factor is generating important metrics is very difficult, to the point that most teams just do without.

Quite often the only way to extract the data you need is through the API, or by utilising a third party plugin. But that's another discussion altogether.

If we ignore the problem of retrieving the data, what data do we actually care about? Well, here's a few things that are of particular interest to me. Although every team will of course have its own requirements, these are probably relatively common.

  • Defect Backlog length, trend over previous 60 days: It seems obvious, but yet it is so often overlooked. The team should be well aware of the length of their defect backlog and should constantly be looking to keep it as small as possible (see Approaching Defects in an Agile Project).

  • Automated test pass rate & automated build pass rate, trend over previous 20 days: The importance of the exact numbers is arguably not important, but changing trends may be an early indicator of problems.

  • Ratio of unit/integration/system tests, trend over previous 60 days: Again, the exact numbers will vary from project to project, but a changing trend could be a sign that you are wasting time and energy trying to test high in the stack.

  • Percentage of regression suite that is automated, trend over 60 days: It goes without saying that you really need your suite to be 100% automated, so if it isn't you should be tracking this closely.

  • Production/escaped defects, trend over previous x releases: Self explanatory but possibly only a concern if you've got a noticeable number of production-found defects.

  • Code Coverage, trend over previous 60 days: Often over-looked in Agile, code coverage and other static analysis.

Anything that I've missed? Leave a comment...