What makes up an RTO?
Recovery Time Objective and Recovery Point Objective are the most important requirements designing a HA or DR solution. High Availability (HA) will have a very low RTO and RPO, is automatic and usually achieved by having multiple systems running concurrently. Disaster Recovery is normally a longer manual process that often is triggered when a DataCenter has had a disaster, In this Article I will go through the time line that needs to be considered for an RTO. I will not go into RPO details here but I will probably do a follow up article if it is requested.
My rough steps for an RTO timeline.
Scenario: The event assumes that the engineer does not need support from the vendor to fix the problem. The issue involves the engineer logging on to a system and doing a corrective action. E.e restart a VM or clearing a cache. The corrective change requires management approval as it may cause an interrupt to another service.
no | Phase | Action | Description | Max Time between steps |
---|---|---|---|---|
1 | Mean time to detect MTTD | Problem Occurs | An outage event has occurred. | n/a |
2 | Mean time to detect MTTD | Problem Detected | Monitoring Solution has detected there was a problem or it was discovered by a user | 30 Seconds |
3 | Mean time to detect MTTD | Alert Sent | Alert is sent to the Operations Team (L2) | 1 second |
4 | Mean time to identify MTTI | Alert Read | Alert is delivered to the Operations Team (L2) | 10 minutes |
5 | Mean time to identify MTTI | Decision: Callout | The operations team (L2) decides if they need to call an engineer (L3). | 10 Minutes |
6 | Mean time to know MTTK | Engineer Allocated | Engineer contact details are retrieved or pager duty is used to contact the engineer | 10 minutes |
7 | Mean time to know MTTK | Engineer Alert Sent | Alert is sent to the Engineer (L3) | 30 seconds |
8 | Mean time to know MTTK | Engineer Alert Read | Alert is Read by the Engineer | 30 minutes |
9 | Mean time to know MTTK | Engineer Logs on to the network | This may be 2am, so includes the engineer waking up. | 30 minutes |
10 | Mean time to know MTTK | Engineer Starts investigating | Engineer reviews the ticket | 5 minutes |
11 | Mean time to know MTTK | Engineer Determines Issue | Engineer understands the problem | Unable to predict |
12 | Mean time to repair MTTR | Engineer Calls out Management for approval | In order to fix the existing issue a VM must be restarted that would impact another critical system for a control period of time. | 5 minutes |
13 | Mean time to repair MTTR | Management Alert Sent | Alert is sent to the Manager | 30 seconds |
14 | Mean time to repair MTTR | Alert is delivered to the Manager | 5 minutes | |
15 | Mean time to repair MTTR | Manager provide a go no go answer to the Engineer | 45 minutes | |
16 | Mean time to repair MTTR | Engineer Fixes | Engineer Fixes the problem | Unable to predict |
17 | Mean time to verify MTTV | Engineer Validates | Engineer validates the fix | Unable to predict |
18 | Mean time to verify MTTV | Problem Fixed | outage is complete | n/a |
Mean time to xx (MTTx) comes from Richard Wilkins
When I talk to many clients they want a near zero RTO for a DR, but they often do not consider the actions that need to be taken beyond fixing the issue. If they measure RTO as just the Engineer fixing the issue (which may be failing over to a second site) then we can usually achieve an RTO of sub an hour.
The decision to complete a DR should not be taken lightly and is usually a management decisions, sometimes going up to the CIO, this takes time. It is easy to say that a decision should only need to take five minutes but if the issue occurs at 2am the manager needs to wake up enough to make a thought out choice, and potentially wake his manager as well.