Money is one of the most human incentives, so it's not surprising that many people start blogging on Steem - a blockchain platform that gives rewards for social activity. But as with many other things getting rich on Steem is not an easy job. Although everyone can get inspired by top authors who get enjoy success, the real life is much more prosaic, and it may take a long time to get on the top.
And one of the most exciting questions is how success is distributed across the platform? How many users receive a lot of upvotes and steems for their activity? Are there some workarounds that can make the way to the top easier? Or maybe actually all of the goods go to a small clique?
Sometimes, all of this questions can be answered by a simple statistical analysis that for example counts the distribution of important variables or searches for correlations between some factors such as the number of users outgoing comments and number of likes. But unfortunately, most of the data about steemians activity varies greatly. For example, let's look at the distribution of a post upvotes number:
As we can see, most of the posts receive a minimal amount of upvotes, but some of them get thousands. The same thing happens if we try to plot the posts payouts distribution:
And since this variables are not always codependent, it would be difficult to manually divide data into subgroups.
Fortunately, there are a lot of well-established machine learning methods that can divide tangled data into the meaningful subgroups. So I've decided to run one of them - KMeans clustering algorithm - on the dataset consisting of info about the post published between 1 and 30 April and share the results with you.
Data used in this research consists of 628633 rows which represent posts and has the following columns: author, total payout value (author + curators rewards), votes count, comments count and author reputation. After data collection, I calculated statistics for each author presented in the dataset. These statistics include median values of comments and votes count as well as median payout values for each author's post.
Overall, there are 59 204 authors presented in the dataset. It has the following structure:
Eventually, recieved authors data was divided into 7 clusters by the KMean algorithm.
After running KMeans, I got the following results:
As we can see from this plot, two subgroups (highlighted by blue and grey) can be separated merely from other authors due to very high median upvotes count. But the other subgroups form a dense 'core' and haven't visibly separable boundaries.
But if we will look at the following table, we will see these subgroups have clear differences, so let's examine them closely:
All of the groups that have a normal count of upvotes divides into three categories. Cluster #3 most likely represents newbies as users in this group have a low reputation. Cluster #5 looks more intriguing as users in this group have higher reputation score but relatively small payouts. It would be interesting to know your suggestions why can't these users get more attention from steemians.
Clusters #0 and #2 forms a "middle class" of the platform. They have average values for all of the variables. The only difference between clusters is in a votes count. It is interesting that although the median votes count in the second group twice as much first, the median payout differs only insignificantly.
Cluster #4 represents "rich and famous" of Steem. These authors have much higher median payouts and upvotes count. But the latter number doesn't exceed few hundreds. They also have highest median comments count. It can signify that these users rely more on readers sympathy than the usage of upvote services.
The remaining two groups have a common feature: very high number of upvotes, which can be related to the usage of bots and paid upvotes services. After the closer look at the cluster #6
we can see that this small subgroup of 11 users consists mostly of users that are related to the service named shadowbot, such as @jonbit.
Eventually, the last cluster (#1) also consists mostly of upvote services user, but their achievements are more modest than in the previous group. Another interesting feature of this group is that it includes a lot of russian-speaking users:
Although this analysis is very simple, I assume we can infer some interesting results from it.
Steem society have relatively well-defined subgroups of bottoms, middle class, and upper class
These subgroups are diverse too and have several subclusters with different behavior strategies
We can see that more than two-thirds of users don't have much success on Steem. One-third of users looks like minnows, 50% looks like more experienced users but have low median payouts
15% of users have average values. And only one percent of users are having the top stats.
Some subgroups of users that use paid upvote services can be easily separated from other users. Moreover, maybe, a larger number of clusters can reveal more such subclusters in the overall community.
I prefer that such kind of analysis gives us a good starting point for more detailed research of Steem community. It will be very interesting to try to find possible correlations between community subgroups and variables that define social activity (e.g., number of posts in a week or voting patterns).
I also should say about limitations of the presented analysis: