Ensuring Resilient Search: How GitHub Enterprise Server Achieved High Availability

Search is the backbone of GitHub's user experience. It powers not only the obvious search bars and filters on the Issues page but also the Releases page, Projects page, and even the counts for issues and pull requests. Recognizing search's critical role, GitHub invested the past year in making it more durable for GitHub Enterprise Server instances. The goal: reduce administrator maintenance burden and increase uptime so teams can focus on what matters most to customers.

The Critical Role of Search in GitHub Enterprise Server

When you interact with GitHub Enterprise Server, search is silently at work behind nearly every interface. From browsing repositories to tracking project milestones, every filtered view and count depends on a healthy search index. In the past, however, maintaining that health required careful manual steps. Administrators had to follow maintenance and upgrade procedures in a precise order; any deviation could corrupt search indexes or lock them during upgrades.

Ensuring Resilient Search: How GitHub Enterprise Server Achieved High Availability — Source: github.blog

The Old Architecture and Its Challenges

GitHub Enterprise Server deployments often rely on High Availability (HA) setups to ensure continuous operation even if a component fails. In a typical HA configuration, there is a primary node handling all writes and traffic, and one or more replica nodes that stay synchronized and can take over if the primary becomes unavailable.

The Elasticsearch Cluster Setup

GitHub used Elasticsearch as its search database. In earlier versions, integrating Elasticsearch into the HA pattern proved difficult. Elasticsearch didn’t natively support a leader/follower model where only the primary accepts writes and replicas are read-only. To work around this, GitHub engineering created an Elasticsearch cluster that spanned the primary and replica nodes. This made data replication straightforward and provided performance benefits because each node could handle search requests locally.

The Deadlock Problem

The cross-server clustering soon introduced serious instability. Elasticsearch could move a primary shard—responsible for receiving and validating writes—to any node. If that shard ended up on a replica node that was taken down for maintenance, a deadlock occurred. The replica would wait for Elasticsearch to become healthy before starting, but Elasticsearch could not become healthy until the replica rejoined. This left the entire system in a locked state, requiring manual intervention.

The Journey to a Better Solution

For several releases, GitHub engineers attempted to stabilize the clustered mode. They added checks to ensure Elasticsearch was healthy and implemented processes to correct drifting states. They even attempted to build a “search mirroring” system to move away from clustering entirely. However, database replication at scale is incredibly challenging, and these early efforts lacked the necessary consistency.

Early Efforts and Their Limitations

Despite significant work, the fundamental issues persisted. The cluster approach forced a trade-off between simplicity and reliability. Administrators had to remain vigilant, and any deviation from the upgrade script could trigger downtime. The need for a more durable architecture became clear.

The New High-Availability Search Architecture

After years of dedicated work, GitHub engineering rebuilt the search architecture for GitHub Enterprise Server. The new design eliminates the clustering deadlock by removing the need for the search database to span across primary and replica nodes in a way that can lead to locked states. While the exact technical details of the new architecture are designed for resilience, the core improvement is that search indexes are now managed independently on each node, with replication handled at the application level rather than inside the search database itself. This means that maintenance tasks can be performed on replicas without risking global lockups.

How It Works

In the rebuilt system, each node runs its own Elasticsearch instance in a dedicated, non-clustered mode. Search data is replicated from the primary to replicas using GitHub’s own robust synchronization mechanisms. This separation ensures that a replica going down for maintenance does not affect the primary’s ability to serve writes or handle search queries. The primary remains in full control, and replicas can be brought up and down independently. This architecture provides the high availability that administrators need without the fragility of cross-server clustering.

Conclusion

GitHub’s investment in search durability pays off by reducing unplanned downtime and simplifying operations. Administrators no longer need to treat index maintenance as a high-risk activity. The new architecture allows teams to focus on their core work, confident that the search functionality behind GitHub Enterprise Server will remain available even during maintenance actions.