CVJointOps outage - Post Mortem

Background:
On Sep 30th, 2022 at epoch 125067 the CVJointOps node went offline and was no longer attesting as part of the sync committee.

Investigation:
After being notified via Discord on Oct 15, 2022 that the CVJointOps node was offline, an investigation began into the issue which stopped the node from attesting. When logging into the node, it was clear that there were no issues with the machine itself. It was online and operating normally. Checking the logs for the clients, it was found that the EL client (besu) had failed and was no longer keeping up with the chain. The failure of the EL client caused the CL client (teku) as well as the SSV container to go down and stop attesting.

Root Cause:
The EL client software was updated many times post-merge to improve EL operation. The last of these updates were not applied to the node, which is believed to have caused the failure.

Resolution:
All clients were updated to their latest release, and the EL and CL databases were removed to avoid any corruption of the data. Once the databases were re-synced, the SSV container was able to continue attesting properly. By Oct 17, 2022 the node was fully operational.

Prevent Recurrence:
When the node went down, it was unknown to the operator due to failures in alerting. New alerts have been set up to push an alert to the operator when either the EL, CL, or SSV container are no longer responding.

4 Likes

Thank you for the update