Of the many non-functional requirements delivered with a project, recoverability remains one of the critical requirements that need to be tested thoroughly.
If you lead any software product or system development that supports critical business functions, your team will be involved in extensive recovery testing. I explain what it involves exactly in this article.
A quick primer about recovery testing
For the uninitiated, Recovery Testing (which also goes by Disaster Recovery) is a non-functional test that checks how quickly a system recovers after a crash or failure.
Disaster Recovery is part of the larger Business Continuity Planning (BCP) exercise. While BCP involves keeping all essential aspects of a business functioning when there is a significant disruption to your business, the focus of recovery testing is to get IT or technology systems that support critical business functions back up and running in the shortest possible time.
With recovery testing, you force-fail your product or system to see how quickly it can recover to normalcy.
For example, let’s say you’re building a staff front-end system to support customer queries when they call in to your call centre. Recovery Testing will help you understand how quickly the system can recover after an outage, during which time call centre staff won’t have access to the system, and therefore won’t be able to assist customers effectively.
Organisations prepare Business Continuity Plans and regularly test Recoverability for disasters much worse than a server outage – such as natural disasters, hack attacks, terrorist attacks, acts of war.
Good planning and frequent recovery testing makes a big difference to the business, its customers, and human life. For instance, when business in parts of Hong Kong came to a standstill for three months during the Umbrella movement in 2014, companies that had prepared a good BCP and regularly tested Recoverability found that they could fall back to alternative arrangements and run business as usual despite the widespread disruptions.
On the other hand, poor BCP and recovery testing resulted in unplanned mobile network outage for days in India in 2015, when unprecedented rain and resultant floods damaged the city of Chennai, resulting in significant loss – of infrastructure, money, and most importantly, life that could have been saved with mobile network availability – for people to make SOS calls for instance.
We know and agree that any system that provides support to critical business functions needs to be reliable. Any outages will have direct impact on bottom line, and more importantly, on your organisation’s reputation. Recovery testing helps you prepare for, and minimise, such outages.
Before we get to the detail, let us examine the difference between recovery and reliability.
Yes – they are related.
And no – they are not the same.
Reliability testing revolves around checking how well your product handles unexpected scenarios, and how frequently (or infrequently) it fails when pushed too far. The less frequent the failures, the better the product’s reliability.
Recovery, reliability’s sibling, then looks at when your product does fail, how quickly can it get back on its feet.
So while reliability testing helps you deploy a stable product, recovery testing measures its ability to bounce back quickly from outages.
Now that we’ve established the basics, let us look at the what/who/when/why/how of recovery testing.
If you’re building a software product for an audience who will be the difference between success and failure for you/your organisation, you need to build enough reliability and recoverability into the product. There may also be regulatory/mandatory stipulations on recoverability, reliability and contingency.
How many times have you read about Facebook or Twitter being down? Not frequently, I agree.
Now, how many times have you seen a Facebook or Twitter outage becoming national and international breaking news? Every time, you’d agree.
The fact that these social networks don’t experience frequent outages points to the level of reliability built into their systems. And when they do experience one, they’re usually back up in a matter of minutes, or worse, hours. That points to recoverability. Still, such an outage can cause irreparable damage that lasts long.
When your product is more tangibly important than a social network, say like a bank’s internet banking service on a long holiday weekend, or a mobile network service, the importance of recovery testing multiplies in significance.
You need to review system-specific recovery testing needs every time you have a release or upgrade. And your organisation should conduct BCP drills at least twice a year, including recovery testing, to make sure everyone involved understands each other’s roles and responsibilities, and critical business functions are able to run normally when there is a failure or disaster.
There can be exceptions – say if your product is still in beta. The idea of beta releases is to allow customers/users access your product to improve your product’s usability with direct feedback, while simultaneously eliminating previously unfound bugs. Beta release and testing is a beast of its own, so we’ll leave it at that for now.
It depends – on the system being tested.
In principle, recovery testing should check that when a failure or crash occurs, the system is up and running in the shortest possible time. For example, take the case of an OS upgrade – say Windows 7 to 10. What would happen if I rebooted my pc midway through installation? Recovery Testing tests such scenarios to assess whether/how quickly the PC can resume the upgrade, and how to keep data and time loss to a minimum.
Another example could be to overload an internet banking site with traffic so it crashes, and check how much time it takes for the servers to recover to normal operations.
The idea is to measure the length of time it takes for the system to resume normal operations, and the percentage of disruptive scenarios that it can recover from. The key metrics you look for are: can the system recover all lost data, can the system come back online quickly enough, and can users reconnect successfully.
Some common terms you will hear when discussing reliability and recovery testing:
- Mean Time to Recover (MTTR): this metric is interpreted differently by different people, so it is important to agree what it means up front with your Service Management team. MTTR can refer to the average target time for full system recovery; alternatively, it could be interpreted as Mean Time to Respond, i.e., the amount of time it takes for the service management team to respond to a reported issue. There could be others such as Mean Time to Replace, Repair etc. but you get the point.
- Redundancy: this is one of the solutions to reducing or altogether eliminating MTTR. For instance, it is common (in fact, mandatory in some cases) practice to have backup servers so that when your primary servers experience an outage, they can failover to the redundant backup machines.
It depends – on your business, your customers, your budget etc.
Let me explain.
Business criticality of the functions your product offers are usually the number one determinant of how much emphasis you should place on recoverability.
Are you offering a beta version of a collaboration app to get market feedback, while you improve its performance, usability, reliability and recoverability? Or are you Blackberry, that enables on-the-go, reliable and encrypted communication for corporates, with emphasis on safety of information exchanged, reliability and recoverability?
When you’re in beta, it’s easy to sign-post that your product isn’t going to be as stable, and that occasional loss of access or even data is part of the sign-up. But when you’re BB, you need your system to deliver every time. Recoverability, as a result, becomes paramount.
It’s not uncommon to see reliability requirements for corporate email or internet banking services stipulate that they need the system to fail less than 1% of the time, and when there is a failure, to be able to recover to business as usual in a matter of minutes (and rarely, hours).
So for an internet banking service, expected MTTR could be as low as two or three minutes. Whereas an internal employees’ e-learning system could accept an MTTR of a few days or even weeks.
So, it depends.
As mentioned earlier, Recovery testing is a part of BCP, and as such involves a host of roles. Specific to recovery testing, this will involve all players that need to support the IT and technology systems get back up – this includes (but not limited to):
- IT operations teams that manage servers and other hardware infrastructure,
- Technical SMEs who understand the core systems supported by the hardware,
- Production Support and Service Management teams that are skilled and experienced in managing such outages, and
- Key business and IT stakeholders (like CIO) that are responsible for impacted critical business functions.
The wider BCP test will involve other parties – like the BCP Manager, IT Disaster Recovery Manager, Incident Management and Operations teams, among others.
Recovery Testing, and by extension, Business Continuity Planning, are critical exercises that your team/organisation need to perform regularly. How frequent your company tests Recoverability directly impacts how prepared you are going to be in the event of a disaster.
The repercussions of failure are vast – financially as well as for the lives affected by an outage or failure.
When you lead any project that overhauls business processes or technology, be sure to review the need for Recovery Testing.
Did you find this article useful? Share any comments or further queries below, and let’s have a healthy discussion. If we helped you even in a small way, please consider sharing this article to others so they can benefit too.