What goes into a Runbook or SOP?
I spend a lot of time advising clients on what they need to keep their system operational. The topic of a RunBook / Standard Operating Procedure (SOP) is always interesting. In my experience only 50-60% of clients have one, 30% of clients know what should go into one but have not yet written it and 10% of clients don’t think they need it.
A run book has one simple objective.
The manual that includes all relevant technical and contact information on how to debug and arrive to a solution in one place.
This is what the new SREs follow to fix problems, this is used at 2am to ensure experienced SREs don’t make common mistakes. This should also be seen as hand over documentation and instructions for training new team members.
Though I describe this as a single document it is frequently multiple documents that sit together. These are living documents and should be regularly updated as changes or new information occurs.
The table below describes the sections that I would expect to see in either a single document or across multiple documents.
Title | Subtitle | Description |
---|---|---|
About This Document | Paragraph explaining the purpose of the document | |
Glossary | Key terms described. Any terms a new team member may not know must be described here | |
Table of Contents | Table of Contents | |
Software Version | Detailed explanation of the software levels installed in each cluster | |
Capacity Planning | What is the current known max capacity of each cluster vs what is required. What is the plan to grow the environment | |
System Overview | ||
Infrastructure Design | Link to the detail design of each cluster | |
Ownership Map | Description of who owns each part of the solution and the areas that it connects to, such as Load Balancers, LDAPs, VMWare, OCP | |
Security and Access Control | ||
API Connect Roles and Responsibilities | Descriptions of the roles used inside the product | |
On boarding | The process to add users to each cluster | |
On Boarding -> Tick List | Check box for what needs to be done to on board a user, i.e. create user in ldap, add to groups xyz, | |
Jump servers | Details of jump servers if required for each environment | |
VMWare / OCP | Details of VMWARE or OCP | |
How to Access each Component | Step by step instructions on how to access each component in the cluster | |
System Configuration | ||
Project Installation Guide | Instructions on how each cluster was installed. This should be used to reinstall each cluster if it is needed | |
System Backup and Restore | ||
Back Up Strategy | Description on how often backups should be taken, to where. | |
Back Up Host | Details of the backup target | |
Backup Requirements | How often should back ups be taken. What format | |
Backup Procedures | Procedure for doing a back up manually. This is essential encase the automatic backup fails. | |
Restore Procedures | Procedure for doing a restore | |
Monitoring and Alerting | ||
Monitoring Project 1: Splunk | How to access Logging/Monitoring services and common queries | |
Alerting | What is the alerting frame work, what constitutes an alert. | |
Operational Tasks | Standard Tasks required of the system. Please see the full sample at the end of this document. I expect this to be screenshot step by step instructions on how to use the system | |
Troubleshooting | Instructions on how to debug the system when something goes wrong. This will be ever growing as new problems occur. | |
Maintenance Tasks | Common Tasks that are required to keep the system healthy, such as Patching, Certificate Management | |
Failure Scenarios and Recovery Procedures | Steps for completing failover and recovery in the event of a major incident. | |
Contact and Escalation Details | I the reader needs additional support or to escalate to someone how do they do it |
Sample Table of Contents
2. Table of Contents
1. About This Document 2
2. Table of Contents 3
3. Glossary 6
4. Software Version 7
4.1. Project 1 7
4.1.1. Production 7
4.1.2. Non Production 7
4.1.3. Development (Out of scope) 7
4.2. Project 2 7
4.2.1. Production 7
4.2.2. Non Production 7
4.2.3. Infrastructure Test 7
5. Capacity Planning 9
5.1. Current Capacity 9
5.1.1. Project 1 9
5.1.2. Project 2 9
5.2. Capacity Review Strategies 10
5.2.1. Housekeeping 10
6. System Overview 12
6.1. Infrastructure Design 12
6.1.1. Project 1 12
6.1.2. Project 2 12
6.2. Ownership Map 12
6.2.1. Project 1 12
6.2.2. Project 2 12
7. Security and Access Control 13
7.1. API Connect Roles and Responsibilities 13
7.1.1. Project 1 13
7.1.2. Project 2 13
7.2. On boarding 13
7.2.1. Tick List 13
7.3. Jump servers 13
7.4. VMWare 13
7.5. How to Access each Component 14
7.5.1. API Manager 14
7.5.2. API Portal 16
7.5.3. DataPower 17
8. System Configuration 19
8.1. Project 1 Installation Guide 19
8.2. Project 2 19
9. System Backup and Restore 20
9.1. Back Up Strategy 20
9.2. Back Up Host 20
9.3. Backup Requirements 20
9.4. Backup Procedures 20
9.4.1. Project 1 20
9.4.2. Project 2 20
9.5. Restore Procedures 20
9.5.1. Project 1 20
9.5.2. Project 2 20
10. Monitoring and Alerting 21
10.1. Monitoring Project 1: Splunk 21
10.1.1. Servers 21
10.1.2. Queries 21
10.2. Monitoring Project 2: LOGGING2.0 21
10.3. Monitoring Project 2: QRADAR 21
10.4. Tivoli Alerting 21
10.4.1. Project 1 21
10.4.2. Project 2 21
11. Operational Tasks 22
11.1. Deploying Gateway Extension 22
11.1.1. Manually 22
11.1.2. Automated Deployment 23
11.2. Deploying Routing Domain 23
11.3. Deploying Products and APIs 23
11.3.1. Deploying with the Command Line toolkit 23
11.4. Deploying Developer Portal Themes 24
11.4.1. Prerequisites 24
11.4.2. Procedure 24
11.5. Uninstall themes from Portal 25
11.5.1. Prerequisites 25
11.5.2. Procedure 25
11.6. Install additional modules to Portal 26
11.6.1. Prerequisites 26
11.6.2. Procedure 26
11.7. Disable a module in Portal 26
11.7.1. Prerequisites 26
11.7.2. Procedure 27
11.8. Uninstall a disabled module in Portal 27
11.8.1. Prerequisites 27
11.8.2. Procedure 27
11.9. Discovering the API Management Domain in DataPower 28
11.10. Validate the MPGs in the API Connect Domain are up 29
11.11. Impact analyses 30
11.11.1. Discover API dependencies per downstream Service 30
11.11.2. Discover downstream Service per API. 31
11.11.3. Discover Applications per API 32
11.11.4. Discover APIs per Application 33
11.12. Procedure to move DataPower server between APIC Gateway Services 33
11.13. Procedure to restart an API Manager Node 35
11.14. Procedure to restart a Developer Portal Node 36
12. Troubleshooting 37
12.1. DataPower 37
12.1.1. Error Codes 37
12.1.2. Log Locations 37
12.2. Developer Portal 37
12.2.1. Log Locations 37
12.2.2. Drupal Site Locations 38
12.3. API Manager 38
12.3.1. Log Files 38
12.3.2. Streaming a log file 39
12.3.3. Downloading log files from the CMC 39
12.3.4. Download log files from the SSH 39
13. Maintenance Tasks 40
13.1. Certificate Management 40
13.1.1. Procedure to update certificates in Back Side TLS Profiles 40
13.1.2. Procedure to update certificates in Front Side TLS Profiles 41
13.2. Patching 41
13.2.1. Normal Cycle 41
13.2.2. Zero-Day Vulnerabilities 42
13.2.3. Project 1 42
13.2.4. Project 2 42
13.3. V4 to V5 Migration - Project 2 Only 42
13.4. Additional Node 42
13.4.1. DataPower 42
13.4.2. API Manager 43
13.4.3. Developer Portal 43
13.5. Removing a Node 43
13.5.1. DataPower 43
13.5.2. API Manager 43
13.5.3. Developer Portal 43
14. Failure Scenarios and Recovery Procedures 45
14.1. API Manager Host Up but runtime is down 45
14.2. Developer Portal Host Up runtime down 45
14.3. Data Centre 1 Outage 46
14.4. Data Centre 2 Outage 51
14.5. API Manager Node Outage Site 1 or 2 55
14.6. Both API Manager Nodes Outage in Site 1 57
14.7. Both API Manager Nodes Outage in Site 2 62
14.8. One or more Developer Portal Node Outage Site 2 or a single Outage in Site 1. 64
14.9. Developer Portal Twp or more Node Outage Site on site 1. 67
14.10. DataPower Node Outage Node or Site 69
14.11. Data Centre 2 Outage 75
14.12. API Manager Node Outage Site 1 or 2 79
14.13. Both API Manager Nodes Outage in Data Centre 1 81
14.14. Both API Manager Nodes Outage in Site 2 85
14.15. One or more Developer Portal Node Outage Site 2 or a single Outage in Site 1. 88
14.16. Developer Portal - two or more Node Outage Site on site 1. 91
14.17. DataPower Node Outage Node or Data Centre 93
14.18. MicroService Layer or DownStream Outage 94
14.19. Up Stream Outage 94
15. Contact Details 96
Thanks to Ricky Moorhouse, Dalli Bagdi and Aiden Gallagher for commenting on this article.