Related Topics: VMware Journal, Objective-C Developer, SharePoint Archiving Journal

Blog Feed Post

Why Should I Care about Recovery Point Objective (RPO) Assurance?

Recovery Point Objectives (RPO) are incredibly important  in your Disaster Recovery planning. Below I present ways you can assure that this critical Service Level Agreement (SLA) is being met today, tomorrow and on that fateful day when some unexpected event comes calling. My favorite definition of RPO comes from the IT Infrastructure Library (ITIL):

The maximum amount of data that may be lost when service is restored after an interruption. The RPO is expressed as a length of time before the failure. For example, an RPO of one day may be supported by daily backups, and up to 24 hours of data may be lost.

The RPO SLA is most important in relation to backup and replication solutions. The actual recovery point achievable in the event of an incident varies over time.  For example, replication streams may not keep up with periods of heavy workload activity or may be held up by network congestion.  Also, backup jobs might fail due to media errors.  In some cases these issues can result in your actual recovery point exceeding the SLA agreed upon with your business stakeholders and if disaster strikes you will not be able to recover as much data as promised.

So how can you tell whether you are exposed to this risk, how often it is happening and which business areas are affected?  Here at Neverfail, we have created a solution that can help. IT Continuity Architect can monitor the achievable recovery points across your infrastructure to detect any drift towards SLA breach and alert you in advance of the risk becoming an issue.  Additionally, it will relate these risks back to your individual business services. Phew! Problem solved. Let’s dig into the details.

doug2

Architect can monitor the achievable recovery point, or the Recovery Point Estimate (RPE) for vSphere replication and our own Failover Engine replication; VMware made commodity replication generally available for all virtual machines beginning with vSphere 5.1 and improved this again as part of the 5.5 release. With support for Volume Shadow Copy Service (VSS) this replication mechanism can be used to create application consistent replicas for production workloads such as Exchange, SQL Server and SharePoint. If you throw in the orchestration capabilities of Site Recovery Manager (SRM) you have a pretty powerful Disaster Recovery solution. Unlike the replication of our Failover Engine, which provides continuous replication with near-zero RPO capabilities, vSphere replication can only support a minimum RPO of 15 minutes.

Additionally, our Failover Engine provides replication for both physical and virtual machines, which can also be orchestrated from SRM (but that is another story). If you do chose to use vSphere replication to protect production workloads, and you do want to mitigate the risks highlighted above, you really ought to be monitoring actual recovery points using Architect’s powerful RPO monitoring and SLA management capabilities. Let’s see how that works.

doug3

Architect automatically discovers all of your infrastructure, applications and their dependencies (both upstream and downstream), and then helps you arrange these into discrete aggregations which support individual business services. Not all business services are equal – with some being more critical than others – so you will naturally want to protect these with a spectrum of SLAs.  In Architect you can assign a range of “protection tiers” to business services which encode, amongst other things, the RPO SLA that you agreed to with the business stakeholders. Now Architect can automatically see the presence of replication activity and continuously check the Recovery Point Estimate (RPE) against the SLA and advise you of any salient events.  Let’s review a few examples to illustrate.

In the graph below Architect is plotting the movement in RPE on a virtual machine (VM) which has vSphere replication enabled. In this scenario vSphere replication has been configured to support an RPO of 15 minutes (or 900 seconds) which is as low as it can go. You can see how the RPE oscillates over the course of 48 hours as the hypervisor tries to deal with fluctuations in workload and network capacity. Unfortunately, in some cases it can’t cope and the RPE exceeds 15 minutes, which violates the SLA and exposes your business to risk of data loss.

doug4

Fortunately, the VM has been placed in a tier in Architect which also has an RPO SLA of 15 minutes and the RPE movement is continuously compared to this SLA. Architect will raise an alert if the RPE comes within a configurable tolerance of the SLA.  In the portlet below you can see that over the period of inspection the virtual machine’s RPE came within 80% of the SLA on 3 occasions and within 50% on another occasion. These warning alerts are designed to allow administrators to react – to check out the network health or other potential root causes that might lie behind the replication stream having difficulty. Because this allows proactive mitigation in advance of the SLA being breached, you have assurance that you will not expose your business to the risk of data loss. For the purposes of illustration I did not intervene in this scenario and allowed the replication stream’s recovery point estimate to degrade beyond the RPO. As you can see below Architect reacts with a critical alert to advise you of this dangerous situation. At this point, if the primary system is compromised for any reason, data loss is inevitable and a disappointing conversation with your business stakeholders will be necessary.

doug6

Architect makes all of its functionality available within VMware’s vSphere web client. RPO monitoring pulls together a number of views as shown below to include:

  • A snapshot of current infrastructure which is within SLA, at risk of SLA breach or actually in breach of SLA.
  • An historical view of how the infrastructure fared over a period of time in terms of SLA health.
  • A timeline of movements in RPE for individual infrastructure elements.
  • A summary of recent or most important alerts relating to RPO SLAs.
  • The ability to change the window of inspection or focus in on specific business services or infrastructure elements

doug5

In summary, monitoring and management of your RPO SLAs is a hugely important aspect of DR planning. It is particularly relevant to replication technologies where the achievable recovery point fluctuates due to various other events on your IT estate. vSphere replication offers a means to protect your virtual production workloads but needs to be monitored and managed to avoid unseen exposure to risk of data loss. IT Continuity Architect, as a vSphere web client plug-in, offers a powerful means of assurance that your DR plans based on vSphere replication are successful. You can see for yourself with a trial download of IT Continuity Architect which you can get right here.

Read the original blog entry...

More Stories By Josh Mazgelis

Josh Mazgelis is senior product marketing manager at Neverfail. He has been working in the storage and disaster recovery industries for close to two decades and brings a wide array of knowledge and insight to any technology conversation.

Prior to joining Neverfail, Josh worked as a product manager and senior support engineer at Computer Associates. Before working at CA, he was a senior systems engineer at technology companies such as XOsoft, Netflix, and Quantum Corporation. Josh graduated from Plymouth State University with a bachelor’s degree in applied computer science and enjoys working with virtualization and disaster recovery.