SRE Golden Signals Playbook

SRE Golden Signals Playbook The SRE Golden Signals Playbook provides a structured, technical workflow for implementing the four foundationa

Most Teams Measure the Wrong Things

Most teams build observability backwards. They start by installing a vendor's dashboard template, then they bolt on whatever metrics their monitoring agent spits out. Within a week, you have a Grafana board with 40 panels showing CPU throttling, disk I/O, and container restart counts. But ask any engineer on call what the system's actual health is, and they have to guess. They don't know if the user is experiencing latency spikes. They don't know if error rates are creeping up against the SLO. They don't know if the database connection pool is saturated.

Install this skill

npx quanta-skills install sre-golden-signals-pack

Requires a Pro subscription. See pricing.

The Google SRE framework established the four golden signals of monitoring exactly because ad-hoc metrics lead to blind spots: latency, traffic, errors, and saturation [1]. If you can only measure four metrics of your user-facing system, these are the ones that tell you whether the service is healthy. Everything else is noise. Yet, most engineering teams treat these signals as an afterthought. They instrument their code with custom counters, they write PromQL queries that break when labels change, and they rely on tribal knowledge to interpret what "normal" looks like. We built this skill so you don't have to. The SRE Golden Signals Playbook gives you a structured, technical workflow for implementing these four foundational metrics across your entire fleet, aligned with the Google SRE framework and the CNCF observability standard.

When you're designing a new service, the default assumption is often to just "add monitoring." But without a disciplined framework, you end up tracking system-level metrics that have zero correlation with user experience. You track how many requests hit the load balancer, but not how long the user waited for the response. You track how many bytes were written to disk, but not whether the database connection pool is exhausted. The difference between a healthy system and a failing one is often hidden in the tail latencies and saturation points that generic dashboards smooth over. By focusing on the golden signals, you force your team to define what success actually looks like for the end user, rather than just keeping the lights on in the data center.

The Cost of Unstructured Observability

When you ignore the golden signals, you pay for it in three ways: wasted engineering hours, degraded customer trust, and downstream incidents that could have been prevented.

First, the hours. Every time a new service ships, your SREs or backend engineers have to manually define what metrics to track. They write custom instrumentation code, configure the telemetry pipeline, and build dashboards. If you have 50 microservices, that's 50 separate implementations of the same four concepts. You're reinventing the wheel while your developers should be shipping features. If you're already drowning in dashboard sprawl, you might want to check out the Monitoring & Observability Pack to see how a unified stack reduces this overhead. The cognitive load of maintaining 50 different metric definitions is unsustainable. You'll spend more time debugging your monitoring stack than your actual application code.

Second, the customer trust. When you don't track latency percentiles, you miss the p99 tail that affects your enterprise users. When you don't track saturation, you don't see the connection pool exhaustion until the requests start queuing and timing out. You're flying blind until the pager goes off. The CNCF emphasizes that matching the right tools to the right tasks is critical for Kubernetes observability, and the golden signals are the universal language that every tool needs to speak [4]. Without them, your dashboards are just pretty pictures that don't help you make decisions. When you do monitor HTTP traffic, you need to track all response codes, even the ones that don't immediately indicate a failure, because they often point to upstream degradation [2].

Third, the incidents. Unstructured observability leads to high Mean Time To Detect (MTTD). When an outage hits, you spend the first twenty minutes figuring out what broke because you don't have a standardized view of traffic, errors, and latency. We've seen teams spend hours debugging a database connection leak when a simple saturation metric would have triggered an alert weeks ago. If you're looking to improve your database reliability, the Database Reliability Engineering pack provides a structured workflow for defining reliability requirements and implementing monitoring that catches these issues early. And when the outage does happen, you'll need to run a blameless postmortem to understand what went wrong. The Blameless Incident Postmortem skill provides a structured workflow that aligns with industry standards, ensuring you learn from the failure without burning out your team.

How loveholidays Standardized Observability Across a Service Mesh

Picture a travel-tech company scaling rapidly during peak booking seasons. They have a microservices architecture with hundreds of endpoints, and every service is written in a different language. Some use Java, some use Go, some use Python. Each team has their own way of exporting metrics: some use StatsD, some use custom HTTP endpoints, and some just log to a file.

When a latency spike hits during a flash sale, the on-call engineer has to query three different metric backends to piece together what's happening. They check the Java service's JMX metrics, the Go service's Prometheus endpoint, and the Python service's cloud provider dashboard. By the time they correlate the data, the error rate has already breached the SLO. The team is reacting, not observing.

This exact scenario played out at loveholidays, a major travel platform that needed to standardize observability across their infrastructure. They adopted a service mesh to handle the complexity of routing and telemetry. By implementing the "Golden Signals" (latency, throughput, errors) for HTTP traffic, they achieved a uniform view of service performance regardless of the underlying language or framework [7]. Using the service mesh, they immediately had a standardized set of metrics for all HTTP traffic, which drastically reduced their MTTD and gave them a consistent baseline for performance [8].

We modeled this playbook after that kind of standardization. You don't need a service mesh to benefit from the golden signals, but you do need a consistent way to define, instrument, and query them. The playbook provides that consistency. It gives you a single source of truth for what latency, traffic, errors, and saturation look like in your environment, and it provides the exact OpenTelemetry and Prometheus configurations to capture them. If you're deploying a service mesh yourself, the Service Mesh Implementation skill provides a structured technical workflow for deploying and configuring service meshes, which can be a great complement to this observability playbook.

What Changes Once the Playbook Is Active

Once you install the SRE Golden Signals Playbook, you stop guessing what to measure and start measuring what matters. The skill acts as an orchestrator, guiding your AI coding agent through the exact steps needed to implement production-grade observability.

You get a skill.md file that defines the framework, maps the four golden signals to OpenTelemetry SDK instruments (Counter, Histogram, Gauge), and instructs the agent on how to use the templates, validators, and examples. It's not just documentation; it's an active workflow that ensures every new service you ship includes the right metrics from day one. The agent will automatically suggest the correct instrument types based on the signal. For latency, it will recommend a Histogram to capture the distribution. For traffic, it will use a Counter to track request volumes. For errors, it will use a Counter with an error label. For saturation, it will use a Gauge to track the current utilization of the resource.

You get a production-grade OpenTelemetry Collector configuration (otel-collector.yaml). It receives OTLP metrics, applies resource processors, and exports to Prometheus via the prometheus exporter. It includes real SDK component routing and metric filtering, so you're not drowning in untagged data. The resource processors will automatically inject service name, version, and environment labels, ensuring that your metrics are queryable without manual intervention. If you're setting up your monitoring stack with Grafana, the Setting Up Monitoring With Grafana skill will help you visualize these metrics effectively.

You get canonical PromQL patterns in prometheus-rules.yaml. We've implemented recording and alerting rules for all four signals using histogram_quantile for latency percentiles, rate() for traffic and error rates, and info() for label enrichment. You don't have to memorize the PromQL syntax or worry about off-by-one errors in your window functions. The rules are pre-configured to alert on p95 and p99 latency, error budgets burning faster than expected, and resource saturation approaching critical thresholds.

You get a validation script (validate-golden-signals.sh) that parses a service's golden signals YAML definition, checks for coverage of all four signals, validates PromQL syntax structure, and exits non-zero if anything is missing. This is your CI/CD gatekeeper. If a developer tries to merge code that doesn't define the golden signals, the build fails. No more debates about whether a metric is "good enough." The script will also validate the JSON Schema to ensure that the configuration file is correctly structured before it even reaches the Prometheus server.

You get a JSON Schema (signal-schema.json) that strictly defines the structure of your golden signals configuration. It ensures required fields for latency, traffic, errors, and saturation are present and correctly typed. This is how you enforce standards at scale. The schema will catch typos, missing fields, and incorrect data types before they make it into production.

And you get a complete, production-ready worked example (worked-example.yaml) that shows exactly how a microservice defines its golden signals, maps them to an OpenTelemetry instrumentation scope, and links them to Prometheus recording rules. It's the reference implementation you can copy, adapt, and ship.

This workflow aligns perfectly with chaos engineering practices. Once you have your golden signals defined, you can use the Chaos Engineering skill to introduce failures and observe how your metrics respond, ensuring your alerts actually catch real-world incidents. If you're building an internal developer platform, the Internal Developer Platform skill provides a structured workflow for selecting core components and designing self-service capabilities, which can include this golden signals playbook as a default template for all new services.

What's in the SRE Golden Signals Playbook

  • skill.md — Orchestrator skill that defines the SRE Golden Signals framework, maps them to OpenTelemetry SDK instruments and Prometheus metric types, and instructs the agent on how to use the templates, validators, and examples to implement production-grade observability.
  • templates/otel-collector.yaml — Production-grade OpenTelemetry Collector configuration that receives OTLP metrics, applies resource processors, and exports to Prometheus via the prometheus exporter. Includes real SDK component routing and metric filtering.
  • templates/prometheus-rules.yaml — Prometheus recording and alerting rules implementing the four golden signals. Uses canonical PromQL patterns like histogram_quantile, rate, and info() for label enrichment and quantile calculation.
  • references/canonical-knowledge.md — Embedded authoritative knowledge covering Google SRE's four golden signals, OpenTelemetry SDK instrument types (Counter, Histogram, Gauge), Prometheus metric semantics, and PromQL query best practices. No external links.
  • scripts/validate-golden-signals.sh — Executable validation script that parses a service's golden signals YAML definition, checks for coverage of all four signals, validates PromQL syntax structure, and exits non-zero if any signal or rule is missing/invalid.
  • validators/signal-schema.json — JSON Schema that strictly defines the structure of a golden signals configuration file, ensuring required fields for latency, traffic, errors, and saturation are present and correctly typed.
  • examples/worked-example.yaml — Complete, production-ready example of a microservice defining its golden signals, corresponding OpenTelemetry instrumentation scope, and Prometheus recording rules. Serves as the reference implementation.
  • tests/test-validation.sh — Automated test suite that runs the validator against the worked example (expecting success) and a deliberately broken config (expecting failure), asserting correct exit codes to guarantee the toolchain works.

Install and Ship

Stop guessing what your system is doing. Start measuring the four signals that actually matter. Upgrade to Pro to install the SRE Golden Signals Playbook and ship services with production-grade observability from day one.

---

References

  1. Chapter 6 - Monitoring Distributed Systems — sre.google
  2. Monitoring Systems with Advanced Analytics — sre.google
  3. Observability for Kubernetes Applications (Golden Signals) — community.cncf.io
  4. loveholidays used Linkerd to boost observability — cncf.io
  5. Reducing MTTD and increasing observability with Linkerd at loveholidays — cncf.io

Frequently Asked Questions

How do I install SRE Golden Signals Playbook?

Run `npx quanta-skills install sre-golden-signals-pack` in your terminal. The skill will be installed to ~/.claude/skills/sre-golden-signals-pack/ and automatically available in Claude Code, Cursor, Copilot, and other AI coding agents.

Is SRE Golden Signals Playbook free?

SRE Golden Signals Playbook is a Pro skill — $29/mo Pro plan. You need a Pro subscription to access this skill. Browse 37,000+ free skills at quantaintelligence.ai/skills.

What AI coding agents work with SRE Golden Signals Playbook?

SRE Golden Signals Playbook works with Claude Code, Cursor, GitHub Copilot, Gemini CLI, Windsurf, Warp, and any AI coding agent that reads skill files. Once installed, the agent automatically gains the expertise defined in the skill.