While performing routine maintenance involving a temporary (and short shutdown) of a redundant power circuit on our core network rack, 2 different servers hosting the Private network routing system known as route reflectors went down at the same time. Both servers were featuring redundant power supplies connected to redundant power circuits.
The impact was the Private network feature being unavailable. Many different services like Compute, SOS, SKS, NLB, DBaaS are relying on Private networks and were also impacted during the outage.
At the same time we experienced a host kernel crash on a single hypervisor which impacted a small number of customer instances.
Time is UTC unless specified otherwise
At around 09h30 the servers running the private network route reflection went down and the private network became unavailable for the whole zone.
At around 09h37 a single hypervisor host is experiencing a kernel crash, impacting a small number of customer instances that went down along the host.
Between 09h48 and 09h55 Private network is recovering while the service is brought back up by starting back the impacted hosts. By 09h55 most of the private network functionality is back up.
At 10h17 the crashed hypervisor is brought online along the instances that were hosted on it.
Private networks are relying on a central system to exchange the routes across the hypervisor hosts in a zone. Such a system is commonly known as route reflectors. 2 different servers are running this system to provide a redundant service. The loss of a single server would not impact the run of the service. Both servers are hosted into our core network infrastructure dedicated rack within the datacenter. These servers are featuring redundant power supplies, disk storage and ECC memory to sustain the loss of a single of those components.
During the morning of 17th January 2023, routine maintenance was scheduled to migrate one of the redundant power circuits to a new PDU (Power Distribution Unit) outlet. This operation is considered as low risk as everything within this core network rack is redundant and would tolerate a short outage of one of the power circuits. The same maintenance has been previously performed in the other zones without any trouble.
Unfortunately shortly after the beginning of the maintenance both of the servers went unexpectedly down at the same time and experienced a power cycle. The result was a total loss of the Private network functionality until one of the servers was bought back up. Our investigations confirmed that:
The hosts power supplies were in a nominal state prior the maintenance No power supplies were reported defective before, during or after the maintenance All power supplies were correctly connected to their respective redundant (different) power circuit There was no exception in the maintenance procedure execution.
At that stage we don’t know the exact root cause why these power supplies failed to survive a single power circuit outage and resulted in a power cycle. Additional and controlled testing will take place to track down the root cause.
At the same time one of our hypervisor running customer instances experienced a kernel crash, resulting in a small amount of instances being unavailable. This host resides in a distinct rack which has not been impacted by the physical maintenance. Our investigations tracked down the issue to be related to one of the running software daemon which caused a page fault, resulting in a kernel panic. We suspect that the race condition is indirectly related to the Private network event. Such a daemon is running on every Exoscale hosts, but only this single server experienced such an issue..
Lesson learned and improvements
Despite all the measures taken, a routine operation resulted in a major impacting outage.
In order to prevent a similar scenario from happening again, we are going to review our operational and risk qualification procedures to take into account less likely failure scenarios like the one that happened. We’ll also take the opportunity to review some of our core infrastructure design in order to increase the resilience of our core systems.
We are sorry for the inconvenience this outage has caused.
Should you have any questions feel free to get in touch with our support.
The Exoscale Team