Incident Report

Saturday, Jan 12, 2019 by GoodBlock

GoodBlock block producer node missed blocks January 12, 2019

By Douglas Horn

Today at 17:02 UTC, the GoodBlock block producer node was kicked from the Telos mainnet after it had failed to produce 15% of the blocks in the current BP rotation schedule. This followed approximately 140 minutes of failed production.

The failure stemmed from an operating system automatic security update that triggered a restart of the network stack. The restart corrupted the block producers routing table, thus causing the network failure. This is the reason why the producer node became unresponsive. The issue was easily corrected once discovered. The failure was not discovered sooner due to our network monitors being offline for service upgrades and migrations related to the expansion of our IPFS development testbed. The temporary monitors in use at the time failed to detect the situation before our block producer node was kicked.

Following the mandatory two-hour kick period, GoodBlock re-registered its block producer node and began producing blocks again at 19:40 UTC. GoodBlock now deploying a decentralized monitoring cluster to resolve all monitoring issues. Completion of the current phase of the IPFS expansion is expected by January 16th. We have no expectation of a recurrence of this issue.

Commentary

GoodBlock strives to consistently provide the highest levels of block production node services. Our node is regularly within the top five fastest on the Telos network, consistently running at about 1.2ms on the Aloha EOS benchmark tests. GoodBlock also has a highly accomplished and respected system administration staff. This incident was not a failure of ability or infrastructure, it was a mistake in priorities. In addition to our block producer duties, GoodBlock is also leading Telos development, including IPFS development as a general feature for Telos. This is a high priority item due to its ability to draw new apps to develop on Telos. We have been in the process of a system expansion to allow more nodes and much more storage in order to increase our IPFS development and testing capacity to enable this. I allowed this priority to overtake our block producing responsibilities, which, combined with an unlikely cascade of failures led to our missed blocks.

As an organization that aims to be a leader in Telos, this is a bit of an embarrassment. However, there are some really good takeaways from this experience. First, we see in the live mainnet environment what we have frequently tested on our testnet and stagenet, which is that when a block producer misses 15% of their blocks, they are kicked from their duties so that a fully functioning BP can step in. This happened exactly as planned. Second, as much as I wish that GoodBlock had not blazed this particular trail, it serves as a reminder that the Telos BP rules apply equally to everyone on the network. There are no special privileges — if you miss blocks you get kicked so the network remains resilient.

Everyone at GoodBlock values the trust that so many Telos users have placed in us. We offer this incident report in the interest of maximum transparency and in the hope that Telos users will continue to trust us to help operate the network.