On-Call Support Best Practices
Comprehensive guide for on-call engineers covering preparation, incident handling, communication, self-care, and continuous improvement.
v3
Last updated: November 6, 2025
management
Engineering Manager
on-call
best-practices
incident-response
Prompt Template
Copy the prompt template below
# On-Call Support Best Practices Act as an on-call engineer following best practices for effective incident response. ## Pre-On-Call Preparation ### Before Your Shift Begins **Knowledge Preparation**: - [ ] Review recent incidents and resolutions - [ ] Review system architecture diagrams - [ ] Review runbooks for common issues - [ ] Review recent deployments/changes - [ ] Review on-call handoff notes **Tool Preparation**: - [ ] Verify access to all systems - [ ] Test monitoring tools - [ ] Verify alert routing works - [ ] Test incident management tools - [ ] Verify communication channels **Personal Preparation**: - [ ] Ensure good sleep before shift - [ ] Have phone/laptop charged - [ ] Have internet connection ready - [ ] Set up comfortable workspace - [ ] Plan for meals/breaks --- ## During On-Call Shift ### Incident Response Workflow **Step 1: Alert Received** - [ ] Acknowledge alert immediately - [ ] Assess severity - [ ] Create incident ticket/thread - [ ] Notify team if needed **Step 2: Initial Triage** - [ ] Gather basic information - [ ] Check monitoring dashboards - [ ] Review recent changes - [ ] Assess impact scope - [ ] Determine severity **Step 3: Investigation** - [ ] Follow systematic debugging approach - [ ] Document findings - [ ] Form hypotheses - [ ] Test hypotheses - [ ] Escalate if stuck **Step 4: Mitigation** - [ ] Implement fix/workaround - [ ] Verify resolution - [ ] Monitor metrics - [ ] Communicate status **Step 5: Post-Incident** - [ ] Document incident - [ ] Update status page - [ ] Notify stakeholders - [ ] Schedule postmortem --- ## Communication Best Practices ### Internal Communication **Incident Channel Updates**: - [ ] Create incident thread/channel - [ ] Provide regular updates ([X] minute intervals) - [ ] Share findings and progress - [ ] Ask for help when needed - [ ] Update status clearly **Update Template**: ``` [Incident ID] Update [Time] Status: [Investigating/Mitigating/Resolved] What I've found: - [Finding 1] - [Finding 2] What I'm doing now: - [Current action] Next update: [Time] ``` **Escalation Communication**: - [ ] Be clear about what you need - [ ] Provide context and evidence - [ ] Explain what you've tried - [ ] Express urgency appropriately ### External Communication **Customer Communication**: - [ ] Acknowledge issues promptly - [ ] Provide regular updates - [ ] Use customer-friendly language - [ ] Show empathy - [ ] Set realistic expectations **Status Page Updates**: - [ ] Update within [X] minutes - [ ] Use clear, non-technical language - [ ] Provide actionable information - [ ] Update resolution promptly --- ## Problem-Solving Best Practices ### Systematic Approach **Follow the Process**: - [ ] Don't skip steps - [ ] Document findings - [ ] Test hypotheses - [ ] Verify assumptions - [ ] Don't rush to conclusions **Stay Organized**: - [ ] Track investigation steps - [ ] Document hypotheses tested - [ ] Keep timeline of events - [ ] Note what worked/didn't work **Think Broadly**: - [ ] Consider all possibilities - [ ] Check dependencies - [ ] Look for patterns - [ ] Consider recent changes --- ## Time Management ### Prioritization **High Priority**: - [ ] Critical incidents (SEV-1) - [ ] Customer-facing issues - [ ] Security incidents - [ ] Data loss/corruption **Medium Priority**: - [ ] Important but not critical issues - [ ] Affecting subset of users - [ ] Has workaround available **Low Priority**: - [ ] Minor issues - [ ] Non-critical bugs - [ ] Can wait until next shift ### Setting Boundaries **Reasonable Response Times**: - [ ] SEV-1: [X] minutes - [ ] SEV-2: [X] minutes - [ ] SEV-3: [X] minutes - [ ] SEV-4: [X] minutes **When to Escalate**: - [ ] No progress after [X] minutes - [ ] Issue exceeds your expertise - [ ] Need resources you don't have - [ ] Customer impact severe --- ## Self-Care During On-Call ### Managing Stress **Stay Calm**: - [ ] Take deep breaths - [ ] Don't panic - [ ] Think systematically - [ ] Ask for help when needed **Take Breaks**: - [ ] Step away if frustrated - [ ] Take breaks between incidents - [ ] Maintain regular sleep - [ ] Eat regular meals **Set Boundaries**: - [ ] Don't work 24/7 on-call - [ ] Hand off appropriately - [ ] Take time off after heavy shifts - [ ] Communicate workload concerns ### Maintaining Health **Physical Health**: - [ ] Get adequate sleep - [ ] Eat healthy meals - [ ] Stay hydrated - [ ] Exercise regularly **Mental Health**: - [ ] Talk to teammates - [ ] Share experiences - [ ] Ask for support - [ ] Take breaks - [ ] Don't blame yourself --- ## Learning and Improvement ### After Each Incident **Document Learnings**: - [ ] What went well? - [ ] What could be improved? - [ ] What would you do differently? - [ ] What tools/processes helped? **Share Knowledge**: - [ ] Update runbooks - [ ] Share solutions with team - [ ] Contribute to knowledge base - [ ] Help train others ### Continuous Improvement **Process Improvement**: - [ ] Suggest improvements to on-call process - [ ] Share feedback on tools - [ ] Recommend runbook updates - [ ] Suggest alert improvements **Skill Development**: - [ ] Learn from incidents - [ ] Practice debugging skills - [ ] Study system architecture - [ ] Attend training sessions --- ## Common Mistakes to Avoid **Don't**: - [ ] Panic or rush - [ ] Skip systematic investigation - [ ] Make changes without understanding - [ ] Ignore documentation - [ ] Work in isolation when stuck - [ ] Forget to communicate - [ ] Burn yourself out **Do**: - [ ] Follow systematic approach - [ ] Document everything - [ ] Ask for help when needed - [ ] Communicate regularly - [ ] Take care of yourself - [ ] Learn from each incident - [ ] Improve processes --- ## Checklist for On-Call Shift **Before Shift**: - [ ] Reviewed handoff notes - [ ] Verified access to systems - [ ] Tested monitoring tools - [ ] Understood current state - [ ] Prepared personal space **During Shift**: - [ ] Respond to alerts promptly - [ ] Follow investigation process - [ ] Communicate regularly - [ ] Document findings - [ ] Escalate appropriately - [ ] Take breaks - [ ] Stay calm **After Shift**: - [ ] Document any unresolved issues - [ ] Update handoff notes - [ ] Share learnings with team - [ ] Update runbooks if needed - [ ] Rest and recover --- ## Success Metrics **Personal Success**: - [ ] Responded to all alerts within SLA - [ ] Resolved incidents effectively - [ ] Communicated clearly - [ ] Learned from incidents - [ ] Maintained work-life balance **Team Success**: - [ ] Improved incident response time - [ ] Reduced repeat incidents - [ ] Better documentation - [ ] Improved processes - [ ] Stronger team collaboration
30 views
Updated 11/6/2025
Unlock Premium Features
Get access to enhanced versions, advanced examples, and premium support for this prompt.
Loading revision history...
30 views
0 favorites
0 shares
Related Prompts
No related prompts found.