This page looks best with JavaScript enabled

Ops I did it again: The SRE Extension is out!

 ·  ☕ 8 min read  ·  💛 Riccardo

Ops I did it again: The SRE Extension is out!

When I met Ramon in a cafeteria here in Google Zurich, I wouldn’t have known that in a few months I would have released an official Open Source package to help SREs around the world! But life is full of surprises, and here we are.

When we talk about SRE (Site Reliability Engineering), we are talking about keeping systems alive under heavy pressure. SREs are constantly fighting outages, config drift, and alert storms. It is a world of pager alerts at 3 AM and frantic Slack threads.

Hero Image

What on Earth is the SRE Extension?

That is why we created the SRE Extension for the Gemini CLI and the Antigravity CLI (gemini and agy).

It is a bundle of specialized, agentic skills that allow AI assistants to perform operations tasks safely.

And now, we have expanded this support so it works perfectly and seamlessly with both the Antigravity CLI (agy), Claude Code and Codex.

The SRE Extension exposes a suite of specialized skills; let’s examine them in a tyipical chronological order:

Dozens od supported products

  • Investigation. investigation-entrypoint. This is where your SRE CLI journey starts, in a safe way. This skill is aware of the others and knows what to call:

    • cloud-logging, cloud-monitoring, and gcp-slo-management for getting all those nice o11y signals you care about, including Service Level Objectives.
    • GCP Playbooks tries to be the place to record our (and your) Playbooks. Currnetly supports Cloud Build, resource managher, GKE and Cloud Run.
    • monitoring-graphs to start and orchestrate outage triage, generate incident charts, and keep SRE investigations safe.
    • Generic Mitigation from Jen Mace article allows the system to classify the outage and find the best remediation.
  • Post Mortem. postmortem-generator to Create a Post Mortem.

    • There’s also a PoMo aggregator skill to do some simple BI / data crunching from an aggregate of Post Mortems.

You can inspect the open-source codebase on the Main SRE Extensions GitHub Repository.

Riccardo’s favorites

Two hidden gemstones:

  • cloud-monitoring. In order to safeguard token expenditure, I’m particularly proud for scripting a python CSV extractor which takes a generic multi-var timeseries and then shows the user a simple, cost-effective Sparkline, like |█▇▆▇ ▂▃ ▂ ▂|. Costs 10 tokens vs thousands (if not millions) for a single CSV!
  • monitoring-graphs actually vibe-codes python scripts amd creates amazing PoMo graphs which would probably make Ben Treynor cry of joy!

See 2 examples here (more exmaples in this repo):

alt text

agy pomo 01

What Can It Do for You? (The Low-Hanging Fruit)

You don’t need to build complex pipelines to get value out of this. Here are some of the immediate, “wow” low-hanging fruits that the SRE Extension can do for you right out of the box:

  1. GCP Config (powered by gcp-playbooks): The agent can inspect your GKE clusters and GCP projects, immediately flagging mismatched configurations, outdated images, or unattached disks. Oftentimes MCP makes the analysis more token-friendly than with just gcloud/kubectl.
  2. Smart Log/Metric Pulling (powered by cloud-logging and anomaly-detection): The investigator calls subagents to retrieve logs and metrics efficiently without bloating the context window.
  3. Automated Post-Mortems (PoMos) (powered by investigation-entrypoint and monitoring-graphs): As the agent investigates, it records system state and diagnostic outputs. At the end of the incident, it automatically drafts a structured markdown PostMortem document and generates annotated telemetry graphs.

Check out the About SRE Extension Page & PostMortems repository to see real examples of how these reports look!

Lesson learnt (todo move somewhere else)

Things I’ve learnt playing with Google harnesses and GCP investigations:

  • Monitoring and Logs are BIG. Ingesting them in the agent results in “tipsiness”. That’s way we’ve instructed the Investigator skill to call those with subagents, and use grep, jq and the magic export_timeseries_to_csv.py script to evince stats from huge metrics via NumPy.

Let’s Break Things: The GKE Demo Outage

SRE Extension Installation & Setup Walkthrough Video

To show you exactly how this works under stress, we recorded some walkthroughs and demo videos where we intentionally broke a live GKE cluster:

  • Demo 01: SRE Extension Demo Outage Investigation Video: Watch us triage an active outage in real-time:
    Demo 01 Video Thumbnail
  • Demo 02: SRE Extension Installation & Setup Walkthrough Video: See how to get up and running from scratch:
    Demo 02 Video Thumbnail

In these walkthroughs, you will see how the agent logs in, securely audits the deployment configuration, correlates the service errors, and isolates the broken workload—saving what would normally be 20 minutes of hunting in under two minutes.

For background context on how Google SRE teams are thinking about these workflows, check out the Google Cloud Blog article: How Google SREs Use Gemini CLI to Solve Real-World Outages.

I want to try your extension on your GKE outage, can I?

Sure you can!

If you are the type of developer or SRE who needs to see it to believe it—or if you want to test your own SRE playbooks and agents in a safe sandbox—we have open-sourced our testing suites:

  • SRE Testing Suite: Madhavi and I got you covered! Find it on the SRE Testing Suite GitHub Repository and just follow the instructions. You can deploy this testing suite to spin up mock environments, simulate active GKE outages, and observe how your AI agents handle incident response.

We’re also working on an GKE Outage Investigation Codelab to get an easy step-by-step codelab to experience this at your own time on your computer.

What Customers Want: Auto-Remediation (And Why It’s Incredibly Risky)

When I talk to Google Cloud customers, one request comes up again and again: “We don’t just want an AI assistant that tells us what’s wrong. We want auto-remediation. We want a magic button that fixes the incident at 3 AM.”

Triage fatigue is real, and nobody wants to be paged just to run the same five command-line diagnostics or manual restarts. SREs want full, end-to-end automation of their runbooks.

So, why not just build a loop that automatically edits your GKE deployments or restarts your VMs?

Because I believe that is an incredibly risky thing to do. Letting an LLM agent run wild with write permissions on your production cluster is a recipe for catastrophic, self-amplifying outages. One bad hallucination or out-of-date context window, and your agent might delete the wrong database or apply a broken config.

This is why I am thinking of a three-way vetted, safe auto-remediation architecture for write operations:

  1. Read-Only Operator: By default, the agent operates strictly in a read-only mode to analyze status and suggest fixes.
  2. Vetted Write Commands: If a remediation is suggested, it goes through a multi-agent vetting pipeline:
    • It is only allowed if the risk classification is flagged as LOW.
    • A secondary, always-fresh agent (with a clean context window and no token/instruction fatigue) independently reviews and verifies the proposed command.
    • If approved, it is executed with higher latency, but with maximum safety.
  3. Out-of-Band Confirmation: For higher-risk remediations, the system halts and asks the user for explicit confirmation. This confirmation can be approved out-of-band—for example, by replying to a GitHub Issue or typing a command in a Telegram bot (very much like we do in OpenClaw!).

If you want to play with the early, simulated prototypes of this approach:

Benjamin 1.0.0 screenshot

Here’s the architecture diagram (.dot and .png) if you’re curious to understand (and maybe improve!) the agent interaction:

Particular of the graph with just Ops and Planning

Fun fact: I’ve vibecoded this idea over telegram with my agent while cycling around Zurich (then refined later on a computer)

Here is a quick recap of the most important resources if you want to explore the SRE Extension further:

Behind the Scenes: The Team

This launch wasn’t a solo effort. A huge shoutout to the SRE and engineering minds who made this possible:

Share Your Thoughts!

We’d love to hear your thoughts and feedback on agentic operations and auto-remediation! Please fill out our SRE Extension Feedback Form to help us shape the future of automated operations.

Share on

Riccardo
WRITTEN BY
Riccardo
Developer Advocate, Google Cloud