Create DevOps Runbook
Develop operational runbooks for common tasks, troubleshooting procedures, and standard operational procedures for DevOps teams.
v3
Last updated: November 6, 2025
management
Engineering Manager
runbook
devops
Loading...
Develop operational runbooks for common tasks, troubleshooting procedures, and standard operational procedures for DevOps teams.
# Create DevOps Runbook
Act as an Engineering Manager creating a DevOps runbook for operational procedures.
## Runbook Context
- **Service/System**: [Name]
- **Environment**: [Production/Staging/Development]
- **Owner**: [Team/person]
- **Last Updated**: [Date]
## Runbook Structure
### 1. Service Overview
**Purpose**: [What this service does]
**Key Components**:
- [ ] [Component 1]: [Description]
- [ ] [Component 2]: [Description]
- [ ] [Component 3]: [Description]
**Dependencies**:
- [ ] [Dependency 1]: [How it's used]
- [ ] [Dependency 2]: [How it's used]
**Architecture**:
- [Diagram or description]
- [Data flow]
- [Key integrations]
---
### 2. Health Checks
**Service Health Endpoint**:
```
GET /health
Expected Response: {"status": "healthy"}
```
**Key Health Indicators**:
- [ ] API response time: [Target < X ms]
- [ ] Error rate: [Target < X%]
- [ ] Database connectivity: [Check]
- [ ] External dependencies: [Check]
**Monitoring Dashboards**:
- [ ] [Dashboard URL] - [What it shows]
- [ ] [Dashboard URL] - [What it shows]
**Alert Thresholds**:
- [ ] Error rate > [X]%: [Alert]
- [ ] Response time > [X]ms: [Alert]
- [ ] CPU > [X]%: [Alert]
- [ ] Memory > [X]%: [Alert]
---
### 3. Common Operations
**Deployment Procedure**:
1. [ ] Backup current version
2. [ ] Run pre-deployment checks
3. [ ] Deploy to staging: [Command]
4. [ ] Verify staging deployment
5. [ ] Deploy to production: [Command]
6. [ ] Monitor deployment metrics
7. [ ] Verify production deployment
**Rollback Procedure**:
1. [ ] Identify version to rollback to
2. [ ] Execute rollback: [Command]
3. [ ] Verify rollback success
4. [ ] Monitor system health
**Scalability Procedures**:
- **Scale Up**: [Commands/Steps]
- **Scale Down**: [Commands/Steps]
- **Auto-scaling**: [Configuration]
**Backup Procedures**:
- **Manual Backup**: [Commands]
- **Backup Verification**: [Commands]
- **Backup Restoration**: [Commands]
---
### 4. Troubleshooting Guide
**Issue: High Error Rate**
- [ ] Check error logs: [Command]
- [ ] Review recent deployments
- [ ] Check dependency health
- [ ] Review system metrics
- [ ] Solution: [Common fixes]
**Issue: Slow Response Times**
- [ ] Check CPU/Memory usage
- [ ] Review database query performance
- [ ] Check network latency
- [ ] Review recent changes
- [ ] Solution: [Common fixes]
**Issue: Service Unavailable**
- [ ] Check service status: [Command]
- [ ] Review infrastructure status
- [ ] Check logs for errors
- [ ] Verify dependencies
- [ ] Solution: [Common fixes]
**Issue: Database Connection Errors**
- [ ] Check database status
- [ ] Verify connection strings
- [ ] Check network connectivity
- [ ] Review connection pool settings
- [ ] Solution: [Common fixes]
---
### 5. Emergency Procedures
**Service Down - Emergency Response**:
1. [ ] Acknowledge incident
2. [ ] Notify team: [Method]
3. [ ] Check service status: [Command]
4. [ ] Review recent deployments
5. [ ] Execute rollback if needed: [Command]
6. [ ] Monitor recovery
**Data Corruption - Emergency Response**:
1. [ ] Stop data writes
2. [ ] Assess corruption scope
3. [ ] Restore from backup: [Command]
4. [ ] Verify data integrity
5. [ ] Resume operations
**Security Incident - Emergency Response**:
1. [ ] Isolate affected systems
2. [ ] Notify security team
3. [ ] Preserve evidence
4. [ ] Assess impact
5. [ ] Deploy patches if needed
---
### 6. Maintenance Tasks
**Regular Maintenance**:
- **Daily**: [Tasks]
- **Weekly**: [Tasks]
- **Monthly**: [Tasks]
- **Quarterly**: [Tasks]
**Log Rotation**:
- [ ] Configuration: [Location]
- [ ] Retention: [Duration]
- [ ] Rotation: [Frequency]
**Certificate Renewal**:
- [ ] Certificates: [List]
- [ ] Renewal process: [Steps]
- [ ] Monitoring: [How to monitor]
**Database Maintenance**:
- [ ] Backup verification: [Schedule]
- [ ] Index optimization: [Schedule]
- [ ] Vacuum/cleanup: [Schedule]
---
### 7. Access & Permissions
**Required Access**:
- [ ] [Service/System]: [Access level]
- [ ] [Service/System]: [Access level]
**Access Request Process**:
- [ ] Request via [Method]
- [ ] Approval required from [Role]
- [ ] Access granted via [Method]
**SSH/Remote Access**:
- [ ] Jump host: [Hostname]
- [ ] SSH command: [Command]
- [ ] Key management: [Process]
---
### 8. Documentation & Resources
**Documentation Links**:
- [ ] Architecture docs: [URL]
- [ ] API docs: [URL]
- [ ] Deployment guide: [URL]
**Related Runbooks**:
- [ ] [Related runbook]: [Link]
- [ ] [Related runbook]: [Link]
**Contact Information**:
- [ ] On-call engineer: [Contact]
- [ ] Team lead: [Contact]
- [ ] Escalation: [Contact]
---
## Runbook Best Practices
**Keep Updated**:
- Review quarterly
- Update after incidents
- Update after major changes
**Test Procedures**:
- Test runbook procedures regularly
- Verify commands work
- Update outdated steps
**Clear & Concise**:
- Use step-by-step format
- Include exact commands
- Provide context for decisionsGet access to enhanced versions, advanced examples, and premium support for this prompt.
Loading revision history...
Apply what you learned with these prompts and patterns
Systematic approach to debugging complex issues with root cause analysis
Systematically generate and test multiple hypotheses to find root causes
Investigate system failures by testing multiple hypotheses systematically