Incident Management Pack

Pro Workflow

Comprehensive incident management system integrating response protocols, on-call rotations, and data-driven optimization. Covers severity cl

We built the Incident Management Pack because we're tired of watching engineers guess whether a database outage is a SEV1 or a SEV3. We've seen too many teams wake up to a Slack war zone, shouting over each other while the actual recovery work sits idle because no one knows who's the Incident Commander or where the runbook lives.

Install this skill

npx quanta-skills install incident-management-pack

Requires a Pro subscription. See pricing.

The problem isn't that your engineers don't care. It's that your incident management system is a collection of tribal knowledge, outdated Confluence pages, and severity definitions that everyone interprets differently. You have a PagerDuty rotation, sure. But when the alert fires, the first five minutes are spent arguing about classification instead of executing recovery steps. We created this skill to force structure into that chaos. We wanted a system where severity is objective, runbooks are executable YAML, and post-incident analysis is a structured data process, not a blame game.

If you're still managing incidents with a mix of email chains and a PDF runbook from 2023, you're already losing. The tools exist to automate the heavy lifting, but you need a framework that actually integrates with your SRE workflow. We designed this pack to sit alongside your on-call-pack and runbook-pack to create a cohesive response ecosystem, but the core issue remains: without a canonical severity classification system, your response is always reactive, never proactive.

What a SEV1 Costs in Downtime and Trust

Let's talk about the cost of ambiguity. When severity is subjective, escalation is delayed. A SEV1 that should trigger an immediate war room and executive notification sits in a junior engineer's inbox for twenty minutes while they debate the definition. That delay compounds. Every minute of a SEV1 outage costs you more than just revenue; it costs you customer trust and engineer sanity.

Google's SRE incident management guide [3] emphasizes that effective response requires a clear, end-to-end overview of how teams manage incidents. Without that structure, you get the scenario described in a recent Reddit thread [5] where a MongoDB cluster crisis woke up the on-call engineer three times in one night. That's not just "bad luck." That's a failure of alerting logic, severity classification, and runbook automation. The engineer wasn't recovering the system; they were playing whack-a-mole with poorly defined incidents.

The financial impact is real. A SEV1 outage can burn through thousands of dollars in cloud costs, engineering time, and SLA credits. But the hidden cost is the "toil" that accumulates. When every incident is treated as a unique snowflake, no one learns. You repeat the same mistakes. You fire the same alerts. You burn out your best SREs. According to industry analysis, a runbook guides the management of common tasks, and automating its steps enhances efficiency by ensuring tasks and checks are executed without manual intervention [7]. If you're not automating the response, you're paying for it in human hours.

A Database Outage That Could Have Been Automated

Imagine a fintech company with 200 API endpoints. It's 2:00 AM on a Saturday. A database cluster goes down. The monitoring system fires a PagerDuty alert. The on-call engineer, let's call her Sarah, gets the page.

In the old way, Sarah opens the runbook. It's a Confluence link. She clicks it. The page loads, but the steps are vague: "Check DB health." She checks. It's down. She pages the DBA team. The DBA is asleep. She waits. Twenty minutes pass. The error rate spikes. Customers start churning. Sarah is stressed, guessing, and flying blind. This is the exact scenario that incident-postmortem-pack helps prevent by ensuring that every incident ends with a clear, actionable post-incident report, but the damage is already done.

Now, imagine Sarah has the Incident Management Pack installed. The alert fires. The AI skill triggers immediately. It loads the severity classification reference. It checks the alert context: database cluster down, error rate > 5%, customer impact confirmed. It classifies this as a SEV1 automatically. It updates the incident status in PagerDuty. It triggers the SEV1 runbook template.

The runbook isn't a vague doc. It's a structured YAML file that defines the exact steps: "Kill traffic to DB1," "Failover to DB2," "Verify health." The AI executes these steps, or guides Sarah through them with precise, validated instructions. PagerDuty automation actions update the incident assignment based on the severity. The war room is populated. The DBA is notified with the full context. Sarah isn't guessing; she's executing. The MTTR drops from 45 minutes to 12 minutes. This is the power of runbook automation tools that convert incident response procedures into executable workflows [2].

This hypothetical illustrates what happens when you stop treating incident management as an art and start treating it as an engineering discipline. It's not about having more people; it's about having better protocols. If you're also looking to integrate this with your security team, check out our incident-response-pack for a framework that covers detection, triage, and containment.

What Changes When the AI Enforces the Protocol

Once you install the Incident Management Pack, the chaos of the war room disappears. You replace subjective judgment with objective validation. Here's what the after-state looks like:

Severity is Canonical. No more "Is this SEV1 or SEV2?" The skill loads the severity classification reference. It enforces SEV1-SEV4 definitions based on impact, scope, and urgency. It maps to the incident.io SeverityV2 API structure, so your data is portable and standardized. You can even integrate with tools like Harness AI SRE to standardize incident classification further [4]. Runbooks are Executable. Your runbooks are no longer PDFs. They are structured YAML files that the AI can parse, validate, and execute. The validate-runbook.sh script ensures every runbook has the required fields: severity, escalation path, and step-by-step actions. If a runbook is missing a critical step, the validator catches it before it's deployed. This is the difference between a document and a tool. Post-Incident Reports are Structured. After the fire is out, you don't have a vague meeting where everyone shares their feelings. You have a structured JSON report. The skill generates a blameless post-incident report aligned with Google SRE best practices [3]. It captures the timeline, root cause, and action items. The report-schema.json validator ensures that every report has the required fields: timeline arrays, root_cause objects, and action_items with assignee and due_at dates. No more lost action items. No more "we'll circle back." The data is structured for tracking and optimization. Metrics are Automated. You don't have to manually calculate MTTR or MRR. The analyze-incident.py script fetches incident metadata from PagerDuty, classifies severity, and outputs your metrics. You get data-driven insights into your response times and recurring issues. This is how you move from reactive firefighting to proactive optimization.

If you're interested in how to apply these structured workflows to other areas, check out the automated-crisis-management-pack for building crisis detection criteria, or the ci-cd-complete-pack for integrating incident response into your deployment pipeline.

What's in the Incident Management Pack

This is a multi-file deliverable. Every file is designed to work together to create a complete incident management system. Here's exactly what you get:

  • skill.md — Orchestrator skill definition. Directs the AI to load severity references, apply runbook templates, execute validation scripts, and generate post-incident reports. References all other files by relative path.
  • templates/runbook-template.yaml — Production-grade structured runbook template. Defines severity routing, escalation policies, step-by-step recovery actions, and PagerDuty context variable placeholders.
  • templates/pagerduty-automation.yaml — PagerDuty Automation Actions configuration. Uses real context variables (${incident.id}, ${user.id}) to trigger scripts, update incident status, and manage assignments.
  • templates/post-incident-report.json — JSON template for blameless post-incident reports. Structured for timeline reconstruction, root cause analysis, and action item tracking aligned with incident.io.
  • references/severity-classification.md — Canonical severity classification system. Documents SEV1-SEV4 definitions, ranking logic, escalation triggers, and incident.io SeverityV2 API object structure.
  • references/postmortem-practices.md — Google SRE postmortem best practices. Covers blameless culture, timeline reconstruction, root cause probing, and action item lifecycle to prevent recurrence.
  • scripts/validate-runbook.sh — Validator script. Parses runbook YAML, enforces required fields (severity, escalation, steps), and exits non-zero on structural or semantic failures.
  • scripts/analyze-incident.py — Executable data-optimization script. Fetches PagerDuty incident metadata via API, classifies severity based on alert counts/status, and outputs MTTR/MRR metrics.
  • examples/worked-example.yaml — Complete worked example. Demonstrates a SEV1 database outage runbook, corresponding PagerDuty automation config, and post-incident report structure.
  • validators/report-schema.json — JSON Schema validator. Enforces post-incident report structure, requires timeline arrays, root_cause objects, and action_items with assignee/due_at fields.

This isn't a collection of snippets. It's a system. The skill.md orchestrates the flow, loading the references, applying the templates, and running the validators. The scripts automate the heavy lifting. The examples show you exactly how to use it. We built this so you don't have to figure it out.

If you want to dive deeper into the operational side, check out the prompt-engineering-pack for advanced AI workflows, or the etl-pipeline-pack for handling data pipelines with error handling and monitoring.

Stop Guessing. Start Responding.

You don't need another tool to manage your incidents. You need a system that enforces the protocols you already know you should follow. The Incident Management Pack gives you that system. It replaces ambiguity with validation, guesswork with automation, and blame with data.

Stop spending your nights arguing about severity. Start executing runbooks. Start generating structured post-incident reports. Start optimizing your MTTR.

Upgrade to Pro to install the Incident Management Pack and ship with confidence.

References

  1. SRE Incident Management Practices Using Rootly Automation — rootly.com
  2. Runbook automation tools 2026: the complete guide to ... — incident.io
  3. Learn sre incident management and response — sre.google
  4. Configure Incident Types — developer.harness.io
  5. What tools do you use at your org? : r/sre — reddit.com
  6. Automated Incident Management & SRE - Material Plus — materialplus.io

Frequently Asked Questions

How do I install Incident Management Pack?

Run `npx quanta-skills install incident-management-pack` in your terminal. The skill will be installed to ~/.claude/skills/incident-management-pack/ and automatically available in Claude Code, Cursor, Copilot, and other AI coding agents.

Is Incident Management Pack free?

Incident Management Pack is a Pro skill — $29/mo Pro plan. You need a Pro subscription to access this skill. Browse 37,000+ free skills at quantaintelligence.ai/skills.

What AI coding agents work with Incident Management Pack?

Incident Management Pack works with Claude Code, Cursor, GitHub Copilot, Gemini CLI, Windsurf, Warp, and any AI coding agent that reads skill files. Once installed, the agent automatically gains the expertise defined in the skill.