Witness Update: 15 Hours In Front of a Computer in Puerto Rico

Yesterday, September 17th 2018, was our last full day in Puerto Rico and unfortunately it wasn't spent enjoying the beach or visiting some cool places (I'll be posting about our adventures in the future). Instead, I was glued to my computer for about 15 hours straight.

As you may have noticed, the Steem blockchain had some issues yesterday. To explain a little of what happened, I'll first explain what witnesses do (you can read more about that here). Witnesses on the Steem blockchain are block producers elected by the token holders to run the network. Steem uses a consensus mechanism (fancy way of saying "process we all agree to") known as Delegated Proof of Stake or DPoS which means the top 20 witnesses by vote (along with a 21st witness rotated in from all the backups) determine what the Steem blockchain actually is. If 2/3+1 of them agree, the software and protocols which define the blockchain can be changed.

Yesterday there were mainly three different versions of the software running. Version 0.19.6 uses the code prior to AppBase which was announced 7 months ago. Version 0.19.12 uses AppBase. Version 0.20.0 is the recently announced Hardfork 20 code which has been in development for quite some time and is running on a testnet. The plan is (was?) to launch Hardfork 20 (HF20) on September 25th. A blockchain "Hardfork" is when new code is released as part of consensus that is not compatible with the old code which means everyone who wants to participate in the network (applications, exchanges, websites, etc) is required to upgrade.

Hardforks are scary things. Most blockchains avoid them, and rightfully so. At the same time, most blockchains quickly become archaic compared to their new competitors in terms of functionality, speed, usability, and more. DPoS allows Steem to innovate quickly and get consensus from the network for upgrades. This is why Hardfork 20 is called "20." This isn't our first rodeo.

Prior to the hardfork launch date, all network participants are encouraged to test and upgrade their systems. The code has checks like this to ensure code that only works with Hardfork 20 doesn't run before everyone is ready for it:

if( has_hardfork( STEEM_HARDFORK_0_20 ) )

and this:

if( a.voting_manabar.last_update_time <= STEEM_HARDFORK_0_20_TIME )

This is important because if a consensus-breaking change happens it can cause a fork which means two different versions of the Steem blockchain exist simultaneously. A fork is bad. You can read all about it here. Forks happen. I remember watching one on the Bitcoin network in real time in March of 2013. Recovering from them is very, very tricky and involves shutting down one version of the chain and moving forward with the other while working through any specific safeguards built into the system to avoid old forks being considered valid (such as Steem's last irreversible block number).

So What Happened?

Now that we've set the groundwork for how this stuff works, I'll try to explain (to the best of my ability) what I think happened yesterday. Please understand, I'm not a blockchain core developer so some of this may be incorrect. Keep an eye on the steemitblog account for further details.

One of the features of Hardfork 20 is the Upvote Lockout Period:

In Hardfork 17, a change was implemented to prevent upvote abuse by creating a twelve-hour lock-out period at the end of a post’s payout period. During this time, users are no longer allowed to upvote the post.

You can read about how Hardfork 20 makes some really good changes there to prevent downvote abuse in the last 12 hours. As someone who experienced that for months, I personally really appreciate this change. Based on this commit, my hunch is the change here caused a problem around block number 26037589 when one version of the code tried to allow a late upvote and the other version did not:

10 assert_exception: Assert Exception
_db.head_block_time() < comment.cashout_time - STEEM_UPVOTE_LOCKOUT_HF17: Cannot increase payout within last twelve hours before payout.
    {}
    steem_evaluator.cpp:1383 do_apply

This caused a hardfork as v20 and v19 diverged. I'm not sure, but I wonder if STEEM_UPVOTE_LOCKOUT_SECONDS was needed with another check to see if HF20 was activated yet or not. Once the problem was discovered, the larger and more complicated problem came up of how to get the v19 fork active again. Most of the witnesses, in preparation for the September 25th HF20 switch over had already upgraded to v0.20.0 but most exchanges were probably on v0.19.12 or older. The number one priority of a blockchain is protecting your funds and as witnesses, we didn't want to move ahead with v0.20.0, forcing everyone to upgrade because of a bug. Hardfork changes must happen based on consensus which requires users to vote for witnesses who will support the changes they want. This means we had to fix v19 and get that chain running again. The Steem blockchain is consistently one of the most active blockchains on the planet averaging more than a million transactions a week:

Because of that volume, the v19 fork got out of sync enough to make it quite difficult to activate again. Modifications to the fork database size were needed. Not only did we have to shut down v20 nodes, we had to update v19 nodes with some code changes and ensure the blockchain peers could talk to each other well without getting flooded with invalid blocks from the v20 chain. This involved coordinating one witness to restart their server with enable-stale-production = true (something you otherwise want to avoid) and required-participation = 0 which means the blockchain can move forward, even if you don't have the normal requirement of a minimum number of witnesses participating in order to produce blocks (33%). We also had to configure a checkpoint which ensures our nodes will follow the correct fork.

Many of the top witnesses (myself included) have backup servers and in situations like this are careful to ensure their backup servers are still running the old version (v19) in case there's a problem with the new code. Unfortunately for many of us, rebuilding the code, adjusting config.ini settings, and restarting our nodes resulted in a forced, unplanned full replay of the blockchain which takes many hours. My nodes took 4.39 hours to replay and sync up. Witnesses running v20 had to download, uncompress, and replay the entire v19 block_log to get running again as well.

All told, it was a 15-hour day in front of my computer. Many witnesses went without sleep to help contribute to this process and provide logs to Steemit, Inc. and community developers (I'll let the heroes who contributed tell their own stories). Communication cordination happened on Steem.chat and Slack among top witnesses during the entire time. Though I've advocated for more real-time transparency in the past, in situations like this, it's helpful to keep the noise down and avoid too many cooks in the kitchen. Even with only about 50 people in a chat room, it's challenging to sift through all the different code patch proposals and ideas circulating about the best path forward. It's also important to ensure information shared is kept private so the chain can't be easily attacked while it's in a vulnerable state.

Securing a distributed consensus blockchain with on-chain governance is really hard. Some who are, for lack of a better word, ignorant of the work involved like to gain attention by criticizing what they don't understand and finding fault in things they didn't build. Every time a complicated software project has a bug, they'll ask why more testing wasn't done or how this could have happened. In my 20+ years of software development experience, I understand that some bugs just happen. They make up the perfect storm of untestable, edge-case scenarios. Could this have been avoided? Maybe, but it would have required a specific set of circumstances on the testnet which includes multiple versions running simultaneously with an upvote in the last 12 hours as consensus flips from v19 to v20. Given all the possible permutations of how the Steem blockchain can be used with tens of thousands of active accounts in any given month, testing for all scenarios is currently impossible.

In situations like this, all you can do is the best you can do. Recover quickly, recover securely, protect user funds, learn from any mistakes made, and improve for the future. I think the top witnesses are doing exactly that. We all worked together like I've never seen before and learned some new things along the way. I won't include the details here because in moments like this, privacy is important to ensure bad actors don't exploit forks or code changes for their own gain at the expense of the network. If you want to know more, ask the witnesses you support how they were involved.

Conclusion

Though the process was a bit rough and took longer than any of us would have liked, I'm proud of how the team came together with a good combination of patience, exploration, and support to get this chain going again. I don't know what this means for Hardfork 20's release date or if this is the only obscure bug lurking in that code. I do know it was a frustrating way to spend my last full day in Puerto Rico (the beach would have been so much nicer!), but I would do it again because I love this blockchain and the community that supports it.

Please stay tuned for more official announcements from Steemit, Inc. and other witnesses. If I've misrepresented something here, please comment below as a correction, and I'll get to it when I can (we're flying back to Nashville today). Both my primary and backup witness nodes are running on v0.19.12 again and my seed node should finish replaying soon. As always, if you have questions, please feel free to ask them, and I'll do my best to answer them or direct you to someone who can.


Luke Stokes is a father, husband, programmer, STEEM witness, DAC launcher, and voluntaryist who wants to help create a world we all want to live in. Learn about cryptocurrency at UnderstandingBlockchainFreedom.com

I'm a Witness! Please vote for @lukestokes.mhth

H2
H3
H4
3 columns
2 columns
1 column
41 Comments
Ecency