Do you think that a Business Continuity Process (BCP) is overly important? Sure, it was something that you needed but it was always the lowest common denominator. If you had robust and resilient technology infrastructure then was it really necessary? Did you really need it? After all, if you had done your Technology job right:
- Your Highly Available architecture would compensate for individual component failure;
- If that failed you could component failover to the Disaster Recovery (DR) site;
- If that failed you could site failover completely to DR.
What could go wrong with that? How many failsafes do you need?
Have you attended meetings and when asked about DR and BCP said that if the three items outlined above failed then you probably had other things to be worrying about (cue “alien Invasion” or “Earthquake” or the like).
How arrogant! (but we all are overconfident at times).
You probably made an number of assumptions such as:
- DR had actually been tested (and passed);
- Recovery Time Objectives (RTO) were reasonable, acceptable and achievable;
- Run Books and Plans were up to date;
- Staff had the appropriate training or knowledge (and were available) and knew the passwords;
- Vendor SLA’s and Contracts are in place (where required) and are up to date;
- You had enough time to cut over (especially relevant for payments when system ‘cut offs’ are near)
These practices require diligence in order to be depended upon otherwise you just end up with the technology stack. And thats where you can go wrong.
I think that to a degree there is a general tendency to have an over reliance on the technologies that are put in place – an over reliance on the stack – and the stack doesn’t work all that well when the assumptions made above actually are incorrect. You discover that you can’t actually get the systems back up and running within the RTO, the firewall rules between Production and DR are different, the run books haven’t been updated from that new Database procedure you implemented 2 months ago, and where the bloody hell did the sys admin put the recovery passwords?
So what happens then when all that good stuff fails and you’re running payment systems? What happens to the property settlements, creditor payments, payroll etc.
That will never happen I hear you say.
It does and it will.
In the last 5 years I’ve heard and read stories of DR databases becoming corrupt because of production cluster replication screwing up the only good backup you have.
Stories of DR attempts failing because no one knew the passwords to the databases in DR.
Stories of microcode firmware updates for SAN controllers going bad during the upgrade leaving ‘High Availability’ controllers offline and not having any ability to access production data (cue the “I hope DR works” line)
Stories of humans pulling both power supplies to a Fibre SAN during a DR failover attempt leaving the RAID configuration corrupt and then not having any data at all. Especially serious when your Virtual Machines running app/web/database exist on that same disk array.
Each of these ‘holy shit’ moments would cause a little fella at the back of your brain to pop out and say … “whats your BCP Dude”. Because out the back is an outage. And an outage in a customer payment system is measured in $$$ not processed per minute. In some instances when you are out the regulators know about it too (as well as your customers).
Sometimes recovery is okay – if it happens well before currency or regulatory cut offs you can get stuff back again, but if you are uncertain how long it might take or if you’re close to a cut off then you have to fall back … to what …?
Thats where BCP is actually very important. If you don’t have a robust BCP you will lose customers. Full Stop. Guaranteed. Because the day will come when you’ll need it. Maybe not tomorrow .. but one day. A good solid robust BCP is a great insurance policy and its just good business. It isn’t gonna stop the house from burning down, but it will minimise the damage caused.
So what does BCP for a customer payments systems look like (typically)?
I could be cynical and suggest that it looks a lot like life before the super reliance on computing came along. Its a process or set of procedures that was last reviewed 10 years ago that relies on a lot of dudes with PC’s running excel spreadsheets, text editors, and a crap load of email so you can move payment instructions around. It has little workflow associated with it, relies largely on legacy systems to do the ‘grunt work’, and is very, very manual. Often humans are actually manually keying instructions into the legacy core processing systems.
Due to the manual nature of it there is absolutely no way that you can process 100% of normal workload. Something has to give. So you prioritise and you end the day with less than 30% of your normal volume processed and hope that the system will come back up tomorrow (cue the Pizza’s and the Coca Cola).
IT DOESNT HAVE TO BE THAT WAY
I’ve experience now in designing and building two semi/fully automated BCP systems for customer payments processing. The idea is to get as close as 100% of the normal volume through.
You can’t do this with just Humans.
The key things that you need are (and I am assuming that your ‘core’ back end processing systems are running):
- An independent technology stack outside of your main input channel.
- A simple BCP focussed application that your customers can use to upload payment files or create payments.
- A feedback mechanism that informs your customers and staff
- A simple workflow (linked to the BCP application) that allows operational staff to manage payments
- A straight through processing mechanism to get payments from the BCP application directly into your core processing systems (again linked to the BCP application)
If you have one of these you are well on your way. If you don’t, then you’d better have a pretty long fire hose and a lot of water.
Good Luck.
Leigh