# Create Incident Management Plan
Act as an Engineering Manager creating an incident management plan.
## Incident Management Context
- **Organization Size**: [Team size]
- **System Complexity**: [High/Medium/Low]
- **Service Level Objectives**: [SLOs for your services]
- **Business Impact Tolerance**: [Acceptable downtime/customer impact]
## Incident Management Framework
### 1. Incident Classification
**Severity Levels**:
- **SEV-1 (Critical)**: [Definition] - [Example]
- Impact: [Service down, data loss, security breach]
- Response Time: [X minutes]
- Escalation: [Immediate to CTO/VP]
- **SEV-2 (High)**: [Definition] - [Example]
- Impact: [Major feature degraded, significant user impact]
- Response Time: [X minutes]
- Escalation: [To Engineering Director]
- **SEV-3 (Medium)**: [Definition] - [Example]
- Impact: [Minor feature degradation, limited user impact]
- Response Time: [X minutes]
- Escalation: [To Team Lead]
- **SEV-4 (Low)**: [Definition] - [Example]
- Impact: [Cosmetic issues, minor bugs]
- Response Time: [X minutes]
- Escalation: [Standard support]
### 2. Incident Response Roles
**Incident Commander**:
- Responsibilities: [Coordinate response, make decisions, communicate status]
- Who: [Role/person]
- Escalation Path: [When to escalate]
**On-Call Engineer**:
- Responsibilities: [Initial triage, investigation, mitigation]
- Who: [Rotation schedule]
- Handoff Procedure: [How to hand off]
**Subject Matter Experts (SMEs)**:
- Responsibilities: [Provide expertise, assist with resolution]
- Who: [List SMEs by domain]
- Contact Method: [How to reach]
**Communication Lead**:
- Responsibilities: [Update stakeholders, manage comms]
- Who: [Role/person]
- Communication Channels: [Slack, email, status page]
### 3. Incident Response Workflow
**Phase 1: Detection & Triage (0-15 minutes)**
- [ ] Incident detected via [monitoring/alerts/tickets]
- [ ] On-call engineer notified
- [ ] Initial severity assessment
- [ ] Incident created in [tool]
- [ ] War room/incident channel created
**Phase 2: Investigation (15-60 minutes)**
- [ ] Gather logs and metrics
- [ ] Identify root cause
- [ ] Assess impact scope
- [ ] Document findings
- [ ] Determine mitigation strategy
**Phase 3: Mitigation (Immediate)**
- [ ] Deploy hotfix/workaround
- [ ] Monitor resolution
- [ ] Verify system recovery
- [ ] Validate customer impact resolved
**Phase 4: Post-Incident (After resolution)**
- [ ] Incident resolved
- [ ] Post-incident review scheduled
- [ ] Status page updated
- [ ] Stakeholders notified
- [ ] Documentation updated
### 4. Escalation Procedures
**Escalation Triggers**:
- [ ] Severity upgrade (e.g., SEV-3 โ SEV-2)
- [ ] No progress after [X] minutes
- [ ] External dependencies blocked
- [ ] Security concern identified
- [ ] Business impact exceeds threshold
**Escalation Path**:
1. On-Call Engineer โ Team Lead
2. Team Lead โ Engineering Manager
3. Engineering Manager โ Engineering Director
4. Engineering Director โ VP/CTO
**Escalation Communication**:
- [ ] Notify escalation contact via [method]
- [ ] Provide incident summary
- [ ] Include current status and blockers
- [ ] Request specific help needed
### 5. Communication Protocols
**Internal Communication**:
- **Incident Channel**: [Slack channel for incident updates]
- **Update Frequency**: [Every X minutes during active incidents]
- **Status Format**: [Template for updates]
**External Communication**:
- **Status Page**: [URL for public status updates]
- **Customer Notifications**: [When/how to notify customers]
- **Executive Updates**: [Format for executive briefings]
**Communication Templates**:
- Initial Incident Alert: [Template]
- Status Update: [Template]
- Resolution Notification: [Template]
### 6. Monitoring & Alerting
**Key Metrics to Monitor**:
- [ ] System uptime/availability
- [ ] Error rates
- [ ] Response times
- [ ] Resource utilization
- [ ] Business metrics (revenue, user activity)
**Alert Configuration**:
- [ ] Alert thresholds defined
- [ ] Alert routing configured
- [ ] On-call schedule configured
- [ ] Escalation policies set
### 7. Tools & Systems
**Incident Management Tools**:
- [ ] [Tool name] - [Purpose]
- [ ] [Tool name] - [Purpose]
**Monitoring Tools**:
- [ ] [Tool name] - [Purpose]
**Communication Tools**:
- [ ] [Tool name] - [Purpose]
### 8. Training & Documentation
**Training Requirements**:
- [ ] New engineers: [Training program]
- [ ] On-call engineers: [On-call training]
- [ ] Incident commanders: [Incident response training]
**Documentation**:
- [ ] Runbooks for common incidents
- [ ] System architecture diagrams
- [ ] Troubleshooting guides
- [ ] Contact lists
## Success Metrics
**Incident Response Metrics**:
- Mean Time to Acknowledge (MTTA): [Target]
- Mean Time to Resolve (MTTR): [Target]
- Incident Frequency: [Target]
- Post-Incident Review Completion: [Target %]