Finding out whether backup and recovery systems work well is more complicated than just knowing how long backups and restores take; agreeing to a core set of essential metrics is the key to properly judging your system to determine if it succeeds or needs a redesign.
Here are five metrics every enterprise should gather in order to insure that their systems meet the needs of the business.
Storage capacity and usage
Let’s start with a very basic metric: Does your backup system have enough storage capacity to meet your current and future backup and recovery needs? Whether you are talking a tape library or a storage array, your storage system has a finite amount of capacity, and you need to monitor what that capacity is and what percentage of it you’re using over time.
Failing to monitor it can result in you being forced to make decisions that might go against your company’s policies. For example, the only way to create additional capacity without purchasing more is to delete older backups. It would be a shame if failure to monitor the capacity of your storage system resulted in the inability to meet the retention requirements your company has set.
Cloud-based object storage can help ease this worry because some services offer an essentially unlimited amount of capacity.
Throughput capacity and usage
Every storage system has the ability to accept a certain volume of backups per day, usually measured in megabytes per second or terabytes per hour. You should be aware of this number and make sure you monitor your backup system’s usage of it. Failure to do so can result in backups taking longer and longer and stretching into the workday.
Monitoring the throughput capacity and usage of tape is particularly important. It is very important for the throughput of your backups to match the throughput of your tape drive’s ability to transfer data. Specifically, the throughput that you supply to your tape drive should be more than the tape drive’s minimum speed. Consult documentation for the drive and the vendor’s support system to find out what the minimum acceptable speed is and try to get as close to that as possible. It is unlikely that you’ll approach the maximum speed of the tape drive, but you should also monitor for that.
Compute capacity and usage
The capability of your backup system is also driven by the ability of the compute system behind it. If the processing capability of the backup servers or the database behind the backup system is unable to keep up, it can also slow down your backups and result in them bleeding into the workday. You should also monitor the performance of your backup system to see the degree to which this is happening.
The previous two metrics are very important because they affect what we call the backup window: the time period during which backups are allowed to run. If you’re using a traditional backup system where there is a significant impact on the performance of your primary systems during backup, you should agree in advance what the backup window is. If you are coming close to filling up the entire window, it’s time to either reevaluate the window or redesign the backup system.
Companies that use backup techniques that fall into the incremental-forever category (e.g. continuous data protection (CDP), near-CDP, block-level incremental backups, or source deduplication backups) don’t typically have to worry about a backup window. This is because backups run for very short periods of time and transfer a small amount of data, a process which typically has very low performance impact on primary systems. This is why customers using such systems typically perform backups throughout the day, as often as once an hour or even every five minutes. A true CDP system actually runs continuously, transferring each new byte as it’s written.
Recovery point and recovery time reality
No one really cares how long it takes you to backup; they care how long it takes to restore. The recovery time objective (RTO) is the amount of time agreed to by all parties that a restore should take after some kind of incident requiring one. The length of an acceptable RTO for any given company is typically driven by the amount of money it will lose when systems are down. For example, if a company will lose millions of dollars per hour during downtime, it typically wants a very tight RTO. Companies such as financial trading firms, for example, seek to have an RTO as close to zero as possible. Other companies that can tolerate longer periods of computer downtime might have an RTO measured in weeks. The important thing is that the RTO matches the business needs of the company.
There is no need to have a single RTO across the entire company. It is perfectly normal and reasonable to have a tighter RTO for more critical applications, and a more relaxed RTO for the rest of the data center.
Recovery point objective (RPO) is the amount of acceptable data loss after a large incident, measured in time. For example, if we agree that we can lose one hour’s worth of data, we have agreed to a one-hour RPO. Most companies, however, settle on values that are much higher, such as 24 hours or more. This is primarily because the smaller your RPO, the more frequently you must run your backup system. Many companies might want a tighter RPO, but they realize that it’s not possible with their current backup system. Like the RTO, it is perfectly normal to have multiple RPOs throughout the company depending on the criticality of different data sets.
The recovery point and recovery time reality metrics are measured only if a recovery occurs – whether real or via a test. The RTO and RPO are objectives, the RPR and RTR measure the degree to which you met those objectives after a restore. It is important to measure this and compare it against the RTO and RPO to evaluate whether you need to consider a redesign of your backup-and-recovery system.
The reality is that most companies’ RTR and RPR are nowhere near the agreed-upon RTO and RPO for their company. What’s important is to bring this reality to light and acknowledge it. Either we adjust the RTO and RPO, or we redesigned the backup system. There is no point in having a tight RTO or RPO if the RTR and RPR are completely different.
What to do with metrics
One of the ways that you can increase the confidence in your backup system is to document and publish all the metrics mentioned here. Let your management know the degree to which your backup system is performing as designed. Let them know – based on the current growth rate – how long it will be before they need to buy additional capacity. And above all make sure that they are aware of your backup and recovery system’s ability to meet your agreed upon RTO and RPO. Hiding this fact will do no one any good if there is an outage.
This story, “5 metrics you need to know about your backup and recovery system” was originally published by
Share this post if you enjoyed! 🙂