Progress update on the new Gridcoin BOINC stats collection system (...

I am excited to announce that the new, native C++ integrated statistics collection system was merged into the development branch on February 25 from my integrated_scraper branch. This amounted to over 250 hours of work by me, Tomáš Brada, and Marco Nilsson (@ravonn), starting with a early scraper prototype by Paul Jensen. This exciting work is scheduled for the Denise milestone release and will drastically improve Gridcoin's statistics scalability and collection stability.

I am including the executive summary (with a little rearrangement) of the official documentation I am writing for the scraper which provides a good explanation of the scraper's advantages over what it replaces...

The existing .net Visual Basic based statistics gathering system, referred to as the Gridcoin “Neural Network”, is not a neural network, and it is unfortunate that name was applied. It is actually a rules based system for gathering 3rd party statistics off blockchain, summarizing them, and then providing a mechanism for the nodes in the network to agree on the statistics. The research rewards are then calculated and generated for/by staking wallets that perform research via the distributed computing platform BOINC, confirmed by other nodes in accordance with blockchain protocols (referred to as Proof-of-Research).

This existing system has a number of serious defects and has been in need of replacement for some time. In October 2018 the author began a project which originally had the goal of fully implementing Paul Jensen’s prototype statistics proxy program (“scraper”). (See https://github.com/gridcoin-community/ScraperProxy). As this project progressed, it became apparent that this was more properly scoped as a complete rewrite of the existing “Neural Network” subsystem, and should be written entirely in C++ as part of the core wallet. The scraper has been developed on the integrated_scraper branch in the author’s Github repository fork of Gridcoin and was merged into the development branch of the official Gridcoin repository on February 25, 2019. (See https://github.com/jamescowens/Gridcoin-Research and https://github.com/gridcoin-community/Gridcoin-Research/commit/989665d699fb9753cd2d519c39ed347d4298652f).

The scraper consists of three major parts.

The actual scraper, contained in the src/scraper directory, handles the downloading of stats files from the BOINC projects, and the filtering, compression, and publishing (with hashes and signatures) of the stats files to the network. (“Scraper”)
The scraper networking code, contained in src/scraper_net.*, uses the wallet messaging system in an elegant fashion to automatically distribute the compressed, hashed, and signed stats files to all of the nodes. The author is grateful to Tomas Brada for writing a very elegant approach for this part. (“Scraper Net”)
The interface to the "neural network", contained in the directory src/neuralnet, which interfaces the core wallet to the scraper and together with the functions in main.cpp, provides the core “neural network” functionality. The author is grateful to Marco Nilsson for this contribution. (“NN”)

Some advantages of the new design:

Since a low (single digit) number of scrapers can effectively supply the network with statistics, solves the Gridcoin scalability problem, while drastically reducing the load on the BOINC project servers from the current levels, with near constant low load thereafter regardless of the scale of the Gridcoin network.
Entirely native C++ and cross platform. Allowing core development to use a simplified development platform, easier, more transparent debugging, and a smaller, simpler installation footprint. Both the scraper and normal (non-scraper NN) nodes will operate on all targets for the Gridcoin wallet. Linux (Intel, armhf, arm64), Windows, and Mac. This includes headless operation as a daemon without a GUI, since the GUI is not necessary.
Once the BOINC stats XML files are downloaded, they are filtered and reduced to csv files on the scrapers, which is far more appropriate and efficient for flat data structures like this. The files are gz compressed and uncompressed using boostio gzip compression filters.
Security is designed in from the outset, with multiple scrapers required for cross-verification so that the scrapers do not have to be trusted entities (see the more detailed discussion below).
The total data storage required in the data Scraper subdirectory for 48 hours of stats files for mainnet is on the order of 4 to 5 MB. (This is opposed to something like 2 GB on the existing NN VB code.)
The total data on-disk storage required for normal (non-scraper) nodes is zero. The scraper code has the ability to decompress and process compressed stats objects in the manifests directly in memory without going to disk first. The in memory requirement for the manifests is on the order of about 1 MB per manifest. (And if the manifests have parts in common they are not repeated, because they are referenced.) Also it is worth noting that normal nodes use a large portion of the scraper code for neural network operation. The scraper code itself was designed to be used in different modes by both the scraper nodes and normal nodes.
The total processing time to form the complete statistics map and create a SB contract on a normal node once a convergence is formed takes about 5 seconds for mainnet on an average Intel box. (This reduces the CPU usage of NN enabled nodes drastically from today.) This is for ~1900 active beacons, and ~15000 statistics elements across 23 whitelisted projects as of December 2018. This level of performance should easily support scaling the Gridcoin network by a factor of 10 or more without any trouble, and with NO additional load on the BOINC sites.
Gets rid of a lot of the old, bizarre snap to grid stuff and other oddities in the old NN code.
Has the core structures in place to support conversion to a TCD based approach, although that will require additional development.
Has the ability to support either no team filtering, or filtering by a team whitelist, based on the needs of the network.

The scraper code is already designed to dispense with the Gridcoin BOINC team requirement and instead use a "consensus beacon list" derived from the appcache list of active beacons to filter the stats by CPID. The code supports using a whitelist of teams allowed in addition to the filtering by CPID, if required by network protocol. Due to security concerns around the beacon advertisement and renewal, until CustomMiner’s Public-key cryptographic proof of account ownership PR #2965 is accepted by BOINC and implemented by the projects, the team requirement via a single entry to the team whitelist, “Gridcoin”, will remain in effect.

Security has been designed in from the outset…

The scrapers have two levels of authorization to operate. The first level, controlled by IsScraperAuthorized(), is whether any node can operate as a "scraper", in other words, download the stats files themselves. That does NOT give them the ability to publish those stats to other nodes on the network. The second level, which is the IsScraperAuthorizedToBroadcastManifests() function, is to authorize a particular node to actually be able to publish manifests to other nodes. The second function is intended to override the first, with the first being a network wide policy. So to be clear, if the network wide policy has IsScraperAuthorized() set to false then ONLY nodes that have IsScraperAuthorizedToBroadcastManifests() can download stats at all. If IsScraperAuthorized() is set to true, then you have two levels of operation allowed. Nodes can run -scraper and download stats for themselves. They will only be able to publish manifests if for that node IsScraperAuthorizedToBroadcastManifests() evaluates to true. This allows flexibility in network protocol without having to do a mandatory upgrade.

The networking code will not allow the acceptance or retention of manifests from nodes that are not authorized to publish, and will ratchet up banscore for those unauthorized nodes, quickly extinguishing them from the network. This prevents flooding the network with malicious scraper attacks and prevents the gaining of control of the consensus by a bad actor. Each scraper is intended to be authorized with a specific private/public key combination, that will be injected by signed administrative message of the corresponding address into the appcache. Only those manifests with a public key/signature that match the approved address will be accepted by any node. This security code operates by necessity on both the send and receive sides of the stats (manifest) messages. Normally a node that is not authorized will not attempt to send manifests, but we have to check on the receive side too (similar to the connectblock/acceptblock for blocks) in case someone maliciously modifies a node to send manifests without authorization. Nodes receiving unauthorized manifests by a malicious scraper will automatically delete those manifests and ban the sending node.

We also have to correctly deal with the deauthorization of a previously authorized scraper by the removal of the authorization key (address). In this case the nodes on the network will be in possession of manifests that were previously authorized and now need to be removed. This removal is accomplished during the housekeeping loop to automatically remove existing manifests from scrapers that have been deauthorized.

The scraper system has been designed to operate in a trustless environment. This means the following: a minimum of 2 independent scrapers must be up and publishing manifests for the stats to be accepted. The scraper implements the idea of a "convergence", which is similar to "quorum" that occurs at a later stage in the SB formation. Convergence on the stats means that a minimum of 2 scrapers or the ceiling of 60% of the scrapers that are actively publishing (whichever is greater), agree on the statistics. Ideally, three or more independent scrapers will be operating. (Five is probably the ideal number.) The statistics are downloaded in a loop on the scraper nodes that can be set for a specific start time time before the need of a SB (nominally 4 hours before a SB is due), and will run continuously in a loop with a nominal 5 min sleep between runs until a SB is formed. The scrapers check the Etags of both team and user statistics files and do not redownload files already downloaded. Since the team IDs corresponding to the whitelisted teams assigned by each project server do not change once assigned, the team IDs are stored on disk to eliminate checking the team files entirely for projects where the team IDs have already been determined. There is a sophisticated file download and retention mechanism, with a default retention period of 48 hours. The scrapers can use a password file which contains usernames and passwords to access sites needing authentication for stats downloads. (This is the solution to the problem of BOINC project statistics sites that need authentication for access, such as Einstein@home.)

Each scraper forms manifests from the downloaded set of compressed scraper files. This is a compound object consisting of an inventory map, and pointers to an independent parts collection, where each part is essentially a compressed stats file turned into a binary object (BLOB). Each part, and the manifest as a whole, is hashed using the native (double) SHA256 hashing algorithm, and the manifest is signed by the scraper using a designated key in the wallet (authorized by the network protocol) when published to the network. The networking code AUTOMATICALLY propagates the manifests to all nodes using the normal wallet messaging infrastructure. The receiving nodes then deserialize the received manifest inventory, check the signature and form of the manifest for validity and authorization, and then request the requisite compressed part objects from the sending node. Once the receiving node receives all compressed parts for the manifest, the integrity of the parts is verified by comparing the hash of the part with the part hash in the signed, verified manifest. This manifest and its referred to parts are now useable on the receiving node to form a convergence once the manifests and parts from multiple scrapers have been received sufficient to meet the convergence rules. This ensures absolute integrity in the delivery of the stats between the scrapers and the nodes.

The nodes overcome the need to "trust" the scrapers by the exertion of the "convergence" requirement to be able to construct a converged set of stats using the convergence rules mentioned earlier. This means each node will cross compare, using the native hashes, the imprint of the stats objects from each scraper and make sure they agree before using them. This drastically reduces the probability of there being a man in the middle problem or source corruption problem with the scraper stats, since the intent is for each scraper in production to be hosted by an independent host, and the probability of a simultaneous attack that would result in the identical corruption of 3 or more authorized, independent scrapers (if there are 5 running) in such a way to make the hashes match is extremely unlikely.

Once a node has received the appropriate manifests, and done the cross compare and determined that a "convergence" exists, the nodes will then form a SB contract upon demand by the appropriate functions in main.cpp (or the NN loop in the scraper) as appropriate.

This contract then operates essentially as the existing NN does, with the hash going in the quorum popularity, etc. Pretty much from this point on, the operation is identical to the existing NN, without all of the baggage.

Progress update on the new Gridcoin BOINC stats collection system (the "Scraper" which replaces the old "Neural Network")