Blog
May 31, 2025

GitHub Database Outage – November 27, 2020

Incident Overview

On November 27, 2020GitHub — the world’s largest code repository and DevOps platform — experienced a major service outage that lasted over 14 hours. For millions of developers, software builds stalled, pull requests were blocked, and CI/CD pipelines failed.

At the heart of the failure was a database storage capacity issue, which led to performance degradation and eventual unavailability of several core services.


Timeline of the Incident

Time (UTC)

Event

15:45 UTC

GitHub monitoring detects latency in the mysql1 cluster for GitHub Actions.

16:10 UTC

Elevated error rates in GitHub Actions workflows and webhooks reported.

17:30 UTC

Multiple services impacted — Pull Requests, GitHub Actions, and Webhooks.

21:00 UTC

Mitigation attempts begin: throttling, replication tuning, and offloading reads.

02:15 UTC (Next day)

Recovery operations start restoring functionality.

05:30 UTC

Services fully restored; RCA initiated.

 

Technical Breakdown

Root Cause: Database Storage Exhaustion

GitHub’s internal infrastructure relies heavily on MySQL clusters with distributed storage and replication. On this day, a critical mysql1 storage node reached 95%+ disk utilization — far exceeding safe thresholds.

What Went Wrong Technically?

  • Replication Lag: Write-heavy loads triggered increased replication delays between the master and replicas.

  • Lock Contention: High disk I/O and lag caused InnoDB locks, leading to blocked queries and timeouts.

  • GitHub Actions Queues: Task runners for CI/CD workflows got backed up, resulting in failed or delayed actions.

  • Monitoring Blind Spot: Alerting thresholds for storage and replication lag were set too leniently, delaying escalation.


Incident Management Breakdown

Detection

  • Alerting from Prometheus and Grafana dashboards flagged replication lag in the mysql1 cluster.

  • User complaints and Twitter reports about GitHub Actions failures reinforced the internal indicators.

Triage

  • Incident declared SEV-1 by GitHub's Site Reliability Engineering (SRE) team.

  • Database teams, CI/CD pipeline owners, and the Platform Infrastructure group joined the bridge.

  • Initial focus was isolating whether the issue was localized to one cluster or cascading across services.

Investigation

  • Engineers found that the primary storage node of mysql1 was nearing disk exhaustion, causing IO waits and deadlocks.

  • Concurrent background jobs and backup operations worsened IOPS saturation.

  • Job queues (for GitHub Actions, webhooks) piled up due to failed writes and slow query responses.

Mitigation & Recovery

  • Read traffic rerouted to healthier replicas.

  • Automated jobs paused to reduce write traffic.

  • Cold storage offloading for old logs and backup files started to free disk space.

  • Live replication rebalancing done to shift workload off impacted nodes.

  • Services were gradually restored after enough performance headroom was achieved.

Closure

  • GitHub issued a transparent and detailed post-incident report the next day.

  • Action items included improved storage alerting, automation in failover decisions, and better isolation of GitHub Actions from backend bottlenecks.


Business & Developer Impact

  • Services Affected: GitHub Actions, Pull Requests, Webhooks, API usage.

  • User Impact: Failed CI builds, blocked PR merges, delayed deployments across major enterprises.

  • Enterprise Effect: Multiple DevOps teams missed release windows due to failed builds.


Lessons for Incident Managers & SREs

Always Monitor Disk Utilization Trends

Storage capacity needs forecasting models, not just threshold alerts. GitHub’s incident emphasized predictive alertsinstead of reactive ones.

Build Decoupled Systems

GitHub Actions was too tightly coupled with the core MySQL cluster. A queueing buffer with retry mechanisms could’ve prevented build failures.

Automate Recovery Playbooks

Manual rerouting and read-shifting cost GitHub hours. Having automated failover and replica scaling policies would have shortened MTTR.

Test Storage Failure Scenarios

Include storage IOPS starvation in chaos testing to see how services degrade and recover — a known blind spot in many DR drills.


GitHub's Post-Outage Improvements

  • Increased early-warning alert thresholds for disk usage and replication lag.

  • Introduced automated offloading systems for stale data to secondary storage.

  • Separated GitHub Actions infrastructure to run on dedicated clusters.

  • Enhanced incident drill documentation with specific DB-recovery SOPs.


Career Cracker Insight

This incident teaches an essential truth: the best engineers are not those who prevent all problems — but those who know how to lead when systems collapse.

Want to lead major bridges and talk like a pro to DBAs, SREs, and product heads?

Join our Service Transition & Operations Management Program — backed by real-world incidents and hands-on case studies.

Free demo + 100% placement guarantee. Pay after you’re hired!