sreincident-responsedevops

Rapid Recovery: Runbooks for Outages Caused by Third-Party CDN and DNS Providers

ooracles

2026-02-07

9 min read

Practical SRE runbooks and automation playbooks to recover from CDN and DNS provider outages—checklists, scripts, and 2026 best practices.

Rapid Recovery: Runbooks for Outages Caused by Third-Party CDN or DNS provider

Hook: When your CDN or DNS provider goes dark, traffic evaporates and SLAs break — but you can limit blast radius and restore service in minutes with the right runbooks and automation. This guide gives SREs pragmatic, tested playbooks, checklists, and automation scripts to recover from CDN and DNS outages in 2026.

Why this matters now (2026 context)

Large, public outages in late 2025 and early 2026 — involving major CDN and cybersecurity edge providers — demonstrated that even top-tier networks can fail at internet scale. These incidents reinforced three trends that shape recovery strategy today:

Multi-CDN and multi-DNS are mainstream: Adoption of multi-vendor edge and DNS strategies rose sharply through 2024–2026 as teams demand resiliency and vendor portability.
BGP and DNS security advances: Wider RPKI validation and DNS-over-HTTPS/TLS adoption reduce attack surface but also add integration complexity during failover.
Automation-first SREs: Teams expect automated failover, verified by synthetic checks and CI/CD pipelines, with clear runbooks for human oversight.

Incident taxonomy: CDN outage vs DNS failures (quick primer for runbooks)

Recovery steps differ depending on whether the outage is a CDN outage (edge caching, WAF, route-to-origin) or a DNS failure (resolution failures or poisoning). Classify the incident quickly — it dictates your tools and order of operations.

CDN outage characteristics

HTTP 5xx or connection errors served from edge
Customer reports of content load but origin reachable
Provider status page indicates degraded edge POPs

DNS failure characteristics

NXDOMAIN, SERVFAIL, or timeouts from public resolvers
Some regions resolve, others do not (often due to anycast/DNS disruption)
WHOIS/registrar-side changes or provider API errors

Core SRE runbook: high-level recovery flow

Use this inverted-pyramid flow during the first 30–90 minutes. Automation should handle routine steps; humans validate and communicate.

Detect & Alert — synthetic monitors, DNS health checks, and CDN edge telemetry.
Validate — confirm local vs global impact with dig, curl, and public resolvers.
Mitigate — activate failover (alternate CDN or direct-to-origin routing), temporary DNS changes, or traffic-shaping via BGP communities or Load Balancer.
Communicate — inform stakeholders and customers using pre-approved templates and status page updates.
Stabilize — monitor until steady-state, then run post-incident analysis and remediation tasks.

Severity levels and goals

P1 (sev-1): Global outage with revenue/critical user impact — target RTO 15–30 minutes.
P2 (sev-2): Regional outage or degraded performance — target RTO 1–3 hours.
P3 (sev-3): Minor impact or monitoring noise — investigate and remediate in next sprint.

Runbook: Immediate triage checklist (first 10 minutes)

Keep this checklist printed in your incident war room and as the first steps in your automated pager message.

Record the time and incident ID in the incident tracker.
Run basic diagnostics from multiple vantage points:

dig +trace example.com @1.1.1.1
dig +short NS example.com
curl -v https://example.com/ --resolve example.com:443:origin-ip
mtr or traceroute to origin and edge IPs

Check provider status pages & public outage reports (Twitter/X, DownDetector) for correlated incidents.
Switch to alternate monitoring baselines (synthetic checks from other regions).
Engage on-call CDN/DNS escalation contacts if you have them.

Commands to run immediately (copy/paste)

# Resolve from Cloudflare public resolver
  dig @1.1.1.1 +short example.com

  # Trace resolution path
  dig +trace example.com

  # Check HTTP from a public vantage (use curl or httping)
  curl -I -sS https://example.com/ --connect-timeout 5

  # Test DNS from Google's resolver
  dig @8.8.8.8 +short example.com

Automated mitigation playbooks (scripts and patterns)

Automation reduces manual error and speeds recovery. Below are actionable playbooks for common failovers.

Playbook A — DNS provider outage: switch NS + A records to alternate provider (Route53 example)

Prerequisites: Alternative DNS zone pre-created in Route53 with health-checked A/AAAA/CNAME records. Registrar allows changing NS quickly.

Script (AWS CLI):

#!/bin/bash
  # Usage: ./failover-dns.sh example.com registrar-api-key

  DOMAIN="example.com"
  ALT_ZONE_ID="ZALTEXAMPLE123"

  # 1) Update TTLs to low value (if not already)
  aws route53 change-resource-record-sets --hosted-zone-id $ALT_ZONE_ID --change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"'$DOMAIN'","Type":"A","TTL":60,"ResourceRecords":[{"Value":"203.0.113.10"}]}}]}'

  # 2) Notify registrar to update NS records to the alt provider's nameservers
  # This step is registrar-specific — use API/console. Provide template for manual change.
  echo "Registrar: update NS for $DOMAIN to ns1.alt-dns.example, ns2.alt-dns.example"

Notes: Registrar NS changes propagate at TLD resolvers and can take time. Pre-provisioned dual-zone approach or glue records at registrar can reduce latency.

Playbook B — CDN outage: fail traffic to alternate CDN or direct-to-origin

Two patterns: DNS-based failover (fast for domain-level switching) or reverse-proxy/LoadBalancer routing (faster if you control edge LB). If your CDN supports origin-pull bypass, use it to route traffic directly to origin.

Example: Cloudflare toggle to disable proxying via API (generic provider example):

# Toggle proxy off (bounce to origin)
  curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE_ID/dns_records/$RECORD_ID" \
    -H "Authorization: Bearer $CF_API_TOKEN" \
    -H "Content-Type: application/json" \
    --data '{"proxied":false}'

For multi-CDN setups, use your control plane (Fastly/NS1/Gcore proprietary APIs) or your own traffic manager to steer away from failing POPs. Keep an automated fallback that reduces TTLs and updates DNS records quickly.

Playbook C — BGP-level traffic steering (for direct IP failovers)

If you operate your own ASN or have BGP controls with cloud providers, pre-seed alternative prefixes and use community tags to steer traffic. This is advanced and needs pre-authorization and tested IRR/RPKI state.

Kubernetes and cloud-native considerations

For SREs running services on Kubernetes, ensure your ingress and DNS operators support rapid flipping:

Use ExternalDNS with multiple providers configured; prefer dry-run and pre-signed credentials for failover.
Keep readiness probes and canary deployments to detect CDN/DNS impacts early.
Use service meshes to route around edge caching failures by bypassing CDN for critical API paths.

Example: ExternalDNS annotation to switch providers

apiVersion: v1
  kind: Service
  metadata:
    name: web
    annotations:
      external-dns.alpha.kubernetes.io/hostname: example.com
      external-dns.alpha.kubernetes.io/provider: route53
  spec:
    type: LoadBalancer
    ports:
    - port: 80

Communication templates (SRE–Ops cadence)

Clear communication reduces duplicate work and customer frustration. Use pre-approved templates in your runbook for speed.

Status update template (first 15 minutes)

Summary: We are investigating connectivity issues to example.com affecting web access. Initial triage indicates CDN provider may be experiencing edge disruptions in the US and EU. We are validating and will publish an update in 15 minutes. Incident ID: INC-2026-001.

Customer-facing update (30–60 minutes)

We’re currently routing critical traffic to an alternate path to restore service. Some cached content may be stale. We estimate partial restoration within X minutes. For status, see: https://status.example.com/INC-2026-001

Testing your runbooks (don’t wait for a real outage)

Runbooks are only useful when practiced. Create a regular game-day schedule that includes:

Simulated DNS provider outage: perform registrar NS swaps in a private test TLD or subdomain.
CDN POP failure simulation: use traffic shaping and IP blocking to emulate edge loss.
Failure drills for BGP and RPKI edge cases (coordinate with NetOps).

Automation test checklist

Automated playbooks run within expected time < target RTO.
Monitoring and synthetic checks detect the change and validate recovery.
Rollback procedures tested: revert DNS and CDN toggles safely.

Observability & pre-incident configuration

Build observability that differentiates between DNS and CDN problems to avoid chasing the wrong provider. Key signals include:

DNS: Increased SERVFAILs and timeouts, inconsistent resolution across resolvers.
CDN: Edge 5xx spikes correlated with specific POPs, origin reachability still OK.
Active synthetic checks from multiple ASN vantage points and RUM sampling for real-user impact.

Suggested checks to deploy

DNS: Global DNS resolution checks (1.1.1.1, 8.8.8.8, 9.9.9.9) every 30s
HTTP synthetic checks through multiple CDNs and direct-to-origin
Edge POP health metrics with anomaly detection

Security, compliance, and auditability

Outages can be caused by misconfigurations and security controls. Ensure your runbooks also capture:

Authentication steps for provider API calls (rotate keys, use least privilege, log all actions).
Audit trails for DNS/registrar changes — registrar API calls and DNSSEC keys must be logged and stored in a tamper-evident system.
Post-incident attestations: record who executed automated playbooks and why.

Post-incident: root cause and remediation runbook

After restoring service, follow a structured postmortem process:

Collect logs and API call history from CDN, DNS, registrar, and your automation systems.
Map timeline and cross-check synthetic checks and RUM data.
Identify single points of failure and prioritize fixes: shorten TTLs, add alternate DNS providers, or update BGP policies.
Implement change requests in a canaryed CI/CD flow; schedule a follow-up game day.

Sample post-incident action items

Reduce authoritative TTLs for critical records to 60s during business hours for faster failover.
Fully provision multi-DNS and multi-CDN with documented rollback plans.
Automate status-page updates from incident manager to reduce manual latency.

Advanced patterns & future-proofing (2026-forward)

Plan for resilience beyond basic failover:

Edge compute portability: Containerize logic at the edge so you can run the same app across CDNs.
Control-plane abstraction: Use a multi-CDN control plane to centralize policies, telemetry, and programmatic failover.
Infrastructure as Code for DNS: Keep DNS zones as code (OctoDNS, Terraform) and version-controlled to accelerate fixes and audits.
Carbon-aware caching: Keep DNS zones as code (OctoDNS, Terraform) and version-controlled to accelerate fixes and audits.

Checklist: Runbook readiness self-audit

Do we have multi-DNS and multi-CDN pre-provisioned and health-checked?
Are TTLs and DNSSEC configured for rapid and secure failover?
Are automation playbooks stored in a secure repo with runbook traceability and approvals?
Do we have escalations documented with provider support SLAs and contacts?
Have we practiced failover in the last 90 days?

Appendix: Useful scripts and templates

1) Quick DNS check script

#!/bin/bash
  DOMAIN=$1
  echo "Checking DNS for $DOMAIN"
  echo "1.1.1.1:"; dig @1.1.1.1 +short $DOMAIN
  echo "8.8.8.8:"; dig @8.8.8.8 +short $DOMAIN
  echo "Trace:"; dig +trace $DOMAIN

2) Toggle CDN proxy (generic API template)

# Replace provider, zone, and API token
  curl -X PATCH "https://api.provider.example/v1/zones/$ZONE/dns/$RECORD" \
    -H "Authorization: Bearer $API_TOKEN" \
    -H "Content-Type: application/json" \
    --data '{"proxied":false}'

3) Incident communication checklist (for war room)

Declare severity and assign incident commander
Run triage commands and paste outputs into incident timeline
Execute automation playbooks (or run dry-run first if unsure)
Publish status page and update every 15 minutes
Prepare postmortem owner and deadline

Final takeaways

Outages in 2025–2026 proved that even trusted CDN and DNS providers can fail at scale. The difference between a prolonged outage and a contained incident is how prepared your SRE team is — with tested runbooks, automated playbooks, and practiced communications. Build for multi-provider resilience, automate safe failovers, and treat runbook drills as part of your CI/CD pipeline.

Call to action

Start a game-day this week: provision a subdomain with a second DNS provider, script a DNS toggle, and test synthetic checks from three ASNs. If you want a ready-made checklist and CI-ready automation templates, download our free SRE CDN & DNS Failover kit and run a guided drill with your team: Edge-first developer playbooks and templates.

oracles

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.