A preview of Hivemind Improvements

One area of Hive code where BlockTrades has spent a lot of time recently is optimization of hivemind. Hivemind servers supply most of the data used by web sites such as hive.blog and peakd.com. So optimizing the performance of Hivemind helps speed up almost all frontends for Hive, and also allows the network to operate reliably as we get more users.

Data flow in the Hive network

Hive construct.png

In the diagram above, hived creates the Hive blockchain from user transaction data, and it does directly serve up some data, but the majority of data it produces is sent to a hivemind server.

Why was Hivemind created?

User interfaces (e.g. websites and wallets) get most of their data from hivemind instead of hived. Hivemind is designed to make it easy to access blockchain data in many different ways.

Originally, user interfaces got all their blockchain data directly from hived nodes. But this caused several problems: 1) many programmers were not skilled in the software language (C++) that hived is written in, making it more difficult to add new features to the Hive network, 2) changes to that code could result in bugs that broke the blockchain servers themselves, and 3) having the hive network nodes serve up this data directly to many users put unnecessary loading on the most critical software component in the network.

Hivemind was created to solve the aforementioned problems. Hivemind is written using technologies that are familiar to many programmers (Python and SQL). If a hivemind server fails due to a bug (this has happened three times already in the past few months), the blockchain continues to run smoothly. And finally, Hivemind servers run in a separate process (and sometimes completely separate servers) from hived, tremendously reducing the data access load on hived.

Making Hivemind better

Below is a short summary of the work we’ve been doing to improve Hivemind in the past couple of months:

Making Hivemind faster

Hive-based web sites make calls to Hivemind to get lists of posts, lists of voters, etc. If one of these calls takes several seconds to complete, the user will have to sit there waiting for the data to be delivered to them. On hive.blog, for example, this delay manifests as as a “spinning wheel” to let the user know that the data is coming, but hasn’t arrived yet.

Every web user is familiar with delays of this type, but they really make web sites less fun to use, and if the delays get too long, users will get frustrated and leave a site.

To reduce these annoying slowdowns, one of the main things we did in this round of hivemind improvements was make hivemind much faster. For example, during benchmark measurements of production installations of hivemind at api.hive.blog, some API calls could take as long as 6-30 seconds to complete (that wheel could spin for a long time occasionally). At this point, with our new optimizations running on that same server, the slowest hivemind call in our benchmark measurements completes within less than 0.1 seconds on average. We optimized the overall average API call execution time by a factor of 10 or more. For some of the most time consuming calls that we spent extra time optimizing, we increased the speed by a factor of 20 or more.

We’ll be publishing the full benchmarks in a separate post as they may be useful to frontends developers interested in details on the updated performance of individual API calls.

Improvements to Hivemind API

We’ve modified Hivemind API call functionality to allow the creation of decentralized blacklists. I’ll be posting technical changes associated with the new API call in the next few days to allow existing web sites to take advantage of this new feature.

Migrated more functionality from Hived to Hivemind

We were able to move a lot of the comment data that was previously stored as state information (stored in memory) from hived to hivemind. This dramatically reduced the RAM memory requirements needed to operate a full Hive API node and it also speeded up the response time of API calls for this data.

Most recently, we made a similar change to move inactive voting data from hived to hivemind, once again dramatically reducing RAM requirements for hived.

Note that in both of the above cases, there was no corresponding increase in RAM used by hivemind from this re-partitioning of functionality, because hivemind stores its information on disk in a database.

As a result of the reduced memory requirements, an entire high-performance full API node can now be run on a single 64GB server, cutting costs by 50% or more (cloud service providers charge a significant premium for servers with large memory requirements, and while it was possible to split a full API node across two servers with less memory, this increased server maintenance costs).

Also, because of the speedups in hivemind’s API performance, that same lower cost server can now serve over 10 times the traffic that a single hivemind could previously be support. So in practice, between the reduced memory requirement and increased ability to serve API traffic, the overall cost of running Hive servers has been reduced by a factor of 20 or more.

To highlight the importance of this change, my recollection is that Steemit Inc was at one point spending over $90K per month to pay for leased servers on AWS (Amazon’s cloud platform).

Faster hivemind sync times

When someones wants to setup a new hivemind API node, they have to first fill it with data from the blockchain. This process is known as syncing, because you are synchronizing the data in the hivemind database to match that of historical data already stored in the blockchain blocks.

Previously, syncing a hivemind database was a multi-day process and this time only increased when we migrated more data from hived to hivemind, so we've also been working on increasing the speed at which hivemind syncs. We don't have final numbers yet, but we've already speeded up syncing by at least a factor of 3 (despite the increased amount of data being synced).

Making Hivemind easier to improve

Added database migration software

Hivemind stores its data in a SQL database. As we improve hivemind, it’s often necessary to change the way data is organized into tables in its database (the database “schema”). By tracking changes in the way this data is stored, we can sometimes upgrade existing installations of hivemind without requiring the database tables to be reorganized and refilled with data from scratch.

We added support for Alembic as a means of tracking and automatically making changes to a hivemind installation’s database when hivemind is upgraded. Alembic also supports rollbacks, allowing a database to be downgraded to a previous version, if a new version of Hivemind has problems that require reverting to a previous version of the Hivemind software.

Refactored Hivemind code base

A lot of source code in Hivemind was repeated in several places, with only small changes in each place, so we spent some time doing code refactoring. Code refactoring is the process of identifying common repeated blocks of code and consolidating that code to a single place. Code refactoring makes it easier to make future improvements to the code, because when code is repeated in many places, it’s easy to forget to make a change in each place (which results in software bugs).

Looking ahead

Now that we’ve moved more data over to hivemind, it becomes easier to add new API methods to look at that data in different ways. And since those API methods are written using Python and SQL, and the changes are isolated from the core blockchain code and can't impact the financial security of the network, it really expands the pool of available developers who can safely create custom API methods for Hive.

This ease of creation for new API methods, combined with the ability to create “custom_json” transactions, creates a world of possibilities for Hive-based applications (as an example, the splinterlands game play is implemented using custom_json transactions). And the increased performance of hivemind means it will be easier than ever to create apps that scale to large numbers of users. I plan to write a followup post soon that explores some of these possibilities more in depth, as I think this capability will play a central role in the future of Hive.

H2
H3
H4
3 columns
2 columns
1 column
45 Comments
Ecency