# Create On-Call Procedures and Playbook
Act as an Engineering Manager creating on-call procedures.
## On-Call Context
- **Team Size**: [Number of engineers]
- **Rotation Schedule**: [Weekly/Monthly/etc.]
- **Coverage**: [24/7/Business hours/etc.]
- **Services Covered**: [List critical services]
## On-Call Framework
### 1. On-Call Rotation
**Rotation Schedule**:
- **Primary On-Call**: [Name] - [Dates]
- **Secondary On-Call**: [Name] - [Dates]
- **Shadow/Backup**: [Name] - [Dates]
**Rotation Rules**:
- [ ] Rotation changes every [day/week]
- [ ] Handoff meeting: [Day/time]
- [ ] Minimum [X] engineers per rotation
- [ ] Backup coverage for vacations
**On-Call Tools**:
- [ ] [Tool] - [Purpose]
- [ ] [Tool] - [Purpose]
---
### 2. On-Call Responsibilities
**Primary Responsibilities**:
- [ ] Respond to alerts within [X] minutes
- [ ] Acknowledge incidents promptly
- [ ] Investigate and triage incidents
- [ ] Escalate when appropriate
- [ ] Document actions taken
- [ ] Communicate status updates
**Response Time Expectations**:
- **SEV-1**: [X] minutes
- **SEV-2**: [X] minutes
- **SEV-3**: [X] minutes
- **SEV-4**: [X] minutes
**Escalation Criteria**:
- [ ] No progress after [X] minutes
- [ ] Severity upgrade needed
- [ ] External dependencies blocked
- [ ] Business impact exceeds threshold
---
### 3. Alert Management
**Alert Prioritization**:
- **P0 (Critical)**: [Immediate response]
- **P1 (High)**: [Response within X minutes]
- **P2 (Medium)**: [Response within X minutes]
- **P3 (Low)**: [Response within X minutes]
**Alert Routing**:
- [ ] [Alert type] → [On-call engineer]
- [ ] [Alert type] → [Escalation path]
**Alert Fatigue Prevention**:
- [ ] Review alert noise monthly
- [ ] Tune alert thresholds
- [ ] Consolidate similar alerts
- [ ] Remove non-actionable alerts
---
### 4. Handoff Procedures
**End-of-Shift Handoff**:
- [ ] Review active incidents
- [ ] Document ongoing issues
- [ ] Share context with next on-call
- [ ] Update handoff document
**Handoff Template**:
```
## On-Call Handoff - [Date]
### Active Incidents
- [Incident details]
### Ongoing Issues
- [Issue details]
### Recent Changes
- [Deployments, config changes]
### Things to Watch
- [Monitoring, potential issues]
### Notes
- [Additional context]
```
**Handoff Meeting**:
- [ ] Scheduled: [Day/time]
- [ ] Duration: [X] minutes
- [ ] Attendees: [Incoming/outgoing on-call]
- [ ] Format: [Sync/async]
---
### 5. Communication Protocols
**Internal Communication**:
- **Incident Channel**: [Slack channel]
- **Update Frequency**: [Every X minutes]
- **Status Format**: [Template]
**Stakeholder Communication**:
- **Who to Notify**: [List]
- **When to Notify**: [Criteria]
- **How to Notify**: [Method]
**Status Page Updates**:
- [ ] Update within [X] minutes of incident
- [ ] Update every [X] minutes during incident
- [ ] Post-resolution update required
---
### 6. On-Call Tools & Access
**Required Access**:
- [ ] [System/Tool]: [Access level]
- [ ] [System/Tool]: [Access level]
**Monitoring Tools**:
- [ ] [Tool]: [URL/Purpose]
- [ ] [Tool]: [URL/Purpose]
**Incident Management Tools**:
- [ ] [Tool]: [URL/Purpose]
- [ ] [Tool]: [URL/Purpose]
**Documentation**:
- [ ] Runbooks: [Location]
- [ ] Architecture docs: [Location]
- [ ] Troubleshooting guides: [Location]
---
### 7. On-Call Best Practices
**During On-Call**:
- [ ] Stay available and responsive
- [ ] Acknowledge alerts promptly
- [ ] Document actions taken
- [ ] Communicate status clearly
- [ ] Escalate when stuck
**Time Management**:
- [ ] Respond to critical alerts immediately
- [ ] Batch low-priority alerts
- [ ] Use downtime for documentation
- [ ] Hand off complex issues
**Learning Opportunities**:
- [ ] Review incidents after shift
- [ ] Update runbooks as needed
- [ ] Share learnings with team
- [ ] Improve procedures
---
### 8. On-Call Compensation & Support
**Compensation**:
- [ ] Extra compensation: [Amount/Time off]
- [ ] Time-in-lieu: [Hours]
- [ ] On-call allowance: [Amount]
**Support & Resources**:
- [ ] Engineering manager available: [Hours]
- [ ] Escalation contacts: [List]
- [ ] Technical support: [Available]
**Wellness**:
- [ ] Maximum consecutive shifts: [Number]
- [ ] Time off after heavy incidents
- [ ] Support for on-call stress
- [ ] Rotation balance
---
### 9. On-Call Training
**New Engineer Onboarding**:
- [ ] Shadow primary on-call: [Duration]
- [ ] Review runbooks
- [ ] Practice incidents
- [ ] Pass on-call test
**Continuous Training**:
- [ ] Monthly on-call review
- [ ] Incident drills
- [ ] Tool training
- [ ] Procedure updates
---
### 10. Metrics & Improvement
**On-Call Metrics**:
- Mean Time to Acknowledge (MTTA)
- Mean Time to Resolve (MTTR)
- On-call workload distribution
- Alert response rate
**Regular Reviews**:
- [ ] Monthly on-call retrospective
- [ ] Review alert noise
- [ ] Update procedures
- [ ] Share learnings
## Success Criteria
**On-Call Effectiveness**:
- [ ] All alerts acknowledged within SLA
- [ ] Incidents resolved within target time
- [ ] Team satisfaction with on-call
- [ ] Continuous improvement demonstrated