Analysis results of the stepping stone Ltd infrastructure failure of 14 June 2022

The failure of stepping stone Ltd's infrastructure on Tuesday 14 June 2022 was caused by a malfunction of the core switches in one of the two data centres.

What was the reason for the outage of the infrastructure?

Two redundant core switches are running in both data centres of stepping stone Ltd, each connecting the cloud infrastructure to the internet via a router. The core switches in one of the two data centres have stopped processing the network traffic. However, the switches' Layer 2 connections to the surrounding systems were maintained. This has resulted in the router perceiving the connection to the faulty switch as active and continuing to accept traffic coming from the Internet and forwarding it to the faulty switch. As a result, the entire infrastructure lost the connection to the Internet.

How did we solve the problem?

After the Border Gateway Protocol (BGP) session on the router was manually deactivated and thus the failover mechanism was activated, the first data centre was again accessible from the Internet. After the malfunctioning core switches were restarted, the second data centre was also accessible from the Internet. The stoney cloud itself worked as planned and we did not experience any data loss due to the failure.

Why did the router failover mechanism not work?

Normally, the router failover mechanism is triggered by an inactive layer 2 connection. In this case, the router automatically deactivates the BGP session and the second router takes over all traffic coming from the Internet. Because the layer 2 connections were active, this failover mechanism did not work.

What short-term countermeasures have we taken?

Together with the data centre operator, stepping stone Ltd has drawn up an emergency plan which ensures that the standby-duty service of the data centre operator and stepping stone Ltd is alerted if the connection to the Internet is lost a second time. In a case like this, the BGP session can be shut down directly so that the failover mechanism takes effect earlier. A possible network interruption would thus be much shorter.

What medium-term countermeasures are we taking?

From mid-August, the switching infrastructure of stepping stone Ltd will be modernised and the entire hardware is replaced. This conversion will provide more redundancy. In addition, the modern Layer 3 protocol BGP Ethernet VPN (EVPN) will be used instead of Layer 2 switching, which makes the detection and handling of faulty connections easier and more reliable. The hardware required for this is already available and initial tests have been successfully carried out. We will communicate a migration plan with corresponding dates by mid-July.