Create Incident Management Plan

Develop a comprehensive incident management plan including roles, escalation procedures, communication protocols, and response workflows.

Last updated: November 6, 2025

management

Engineering Manager

incident-management

devops

# Create Incident Management Plan Act as an Engineering Manager creating an incident management plan. ## Incident Management Context - **Organization Size**: [Team size] - **System Complexity**: [High/Medium/Low] - **Service Level Objectives**: [SLOs for your services] - **Business Impact Tolerance**: [Acceptable downtime/customer impact] ## Incident Management Framework ### 1. Incident Classification **Severity Levels**: - **SEV-1 (Critical)**: [Definition] - [Example] - Impact: [Service down, data loss, security breach] - Response Time: [X minutes] - Escalation: [Immediate to CTO/VP] - **SEV-2 (High)**: [Definition] - [Example] - Impact: [Major feature degraded, significant user impact] - Response Time: [X minutes] - Escalation: [To Engineering Director] - **SEV-3 (Medium)**: [Definition] - [Example] - Impact: [Minor feature degradation, limited user impact] - Response Time: [X minutes] - Escalation: [To Team Lead] - **SEV-4 (Low)**: [Definition] - [Example] - Impact: [Cosmetic issues, minor bugs] - Response Time: [X minutes] - Escalation: [Standard support] ### 2. Incident Response Roles **Incident Commander**: - Responsibilities: [Coordinate response, make decisions, communicate status] - Who: [Role/person] - Escalation Path: [When to escalate] **On-Call Engineer**: - Responsibilities: [Initial triage, investigation, mitigation] - Who: [Rotation schedule] - Handoff Procedure: [How to hand off] **Subject Matter Experts (SMEs)**: - Responsibilities: [Provide expertise, assist with resolution] - Who: [List SMEs by domain] - Contact Method: [How to reach] **Communication Lead**: - Responsibilities: [Update stakeholders, manage comms] - Who: [Role/person] - Communication Channels: [Slack, email, status page] ### 3. Incident Response Workflow **Phase 1: Detection & Triage (0-15 minutes)** - [ ] Incident detected via [monitoring/alerts/tickets] - [ ] On-call engineer notified - [ ] Initial severity assessment - [ ] Incident created in [tool] - [ ] War room/incident channel created **Phase 2: Investigation (15-60 minutes)** - [ ] Gather logs and metrics - [ ] Identify root cause - [ ] Assess impact scope - [ ] Document findings - [ ] Determine mitigation strategy **Phase 3: Mitigation (Immediate)** - [ ] Deploy hotfix/workaround - [ ] Monitor resolution - [ ] Verify system recovery - [ ] Validate customer impact resolved **Phase 4: Post-Incident (After resolution)** - [ ] Incident resolved - [ ] Post-incident review scheduled - [ ] Status page updated - [ ] Stakeholders notified - [ ] Documentation updated ### 4. Escalation Procedures **Escalation Triggers**: - [ ] Severity upgrade (e.g., SEV-3 → SEV-2) - [ ] No progress after [X] minutes - [ ] External dependencies blocked - [ ] Security concern identified - [ ] Business impact exceeds threshold **Escalation Path**: 1. On-Call Engineer → Team Lead 2. Team Lead → Engineering Manager 3. Engineering Manager → Engineering Director 4. Engineering Director → VP/CTO **Escalation Communication**: - [ ] Notify escalation contact via [method] - [ ] Provide incident summary - [ ] Include current status and blockers - [ ] Request specific help needed ### 5. Communication Protocols **Internal Communication**: - **Incident Channel**: [Slack channel for incident updates] - **Update Frequency**: [Every X minutes during active incidents] - **Status Format**: [Template for updates] **External Communication**: - **Status Page**: [URL for public status updates] - **Customer Notifications**: [When/how to notify customers] - **Executive Updates**: [Format for executive briefings] **Communication Templates**: - Initial Incident Alert: [Template] - Status Update: [Template] - Resolution Notification: [Template] ### 6. Monitoring & Alerting **Key Metrics to Monitor**: - [ ] System uptime/availability - [ ] Error rates - [ ] Response times - [ ] Resource utilization - [ ] Business metrics (revenue, user activity) **Alert Configuration**: - [ ] Alert thresholds defined - [ ] Alert routing configured - [ ] On-call schedule configured - [ ] Escalation policies set ### 7. Tools & Systems **Incident Management Tools**: - [ ] [Tool name] - [Purpose] - [ ] [Tool name] - [Purpose] **Monitoring Tools**: - [ ] [Tool name] - [Purpose] **Communication Tools**: - [ ] [Tool name] - [Purpose] ### 8. Training & Documentation **Training Requirements**: - [ ] New engineers: [Training program] - [ ] On-call engineers: [On-call training] - [ ] Incident commanders: [Incident response training] **Documentation**: - [ ] Runbooks for common incidents - [ ] System architecture diagrams - [ ] Troubleshooting guides - [ ] Contact lists ## Success Metrics **Incident Response Metrics**: - Mean Time to Acknowledge (MTTA): [Target] - Mean Time to Resolve (MTTR): [Target] - Incident Frequency: [Target] - Post-Incident Review Completion: [Target %]

Create Incident Management Plan

Unlock Premium Features

Related Prompts

Try These Resources

Related Prompts

Use on-call alert triage prompt

Use infrastructure as code prompt

Use incident post-mortem facilitator prompt