[ENG/ITA] Python and Hive: A Tool to Simplify Curation | Work in Progress!

cover


La versione italiana si trova sotto quella inglese

The italian version is under the english one


Python and Hive: A Tool to Simplify Curation | Work in Progress!

My first Python project involved creating a small bot that could upvote and comment on posts under which I had previously left a comment containing a certain keyword: its usefulness is to be able to use only one account to choose if and how much to upvote a post with my secondary account.

In fact, sometimes I might want to upvote only with my main account, sometimes only with my secondary account, sometimes with both but with different percentages... that's why setting up a curation trail with hive.vote in these cases might be too restrictive, while having my own custom bot that allows, each time, to choose what to do enables me to be much more flexible and avoid wasting precious upvotes.

In comparison with the code shared last time I have made some small improvements, some suggested by other users, others added to make the code more robust and less likely to crash unexpectedly.

Now I am finishing some last small details, but meanwhile you can already find the script on GitHub... or at least you will be able to find it as soon as I set the privacy to “public” 😂 so if you click on the link shortly after this post is published you will sadly see nothing yet.


Now let's move onto a new project!

Having (almost) finished the first project, it's time to move on to something different, in an effort to learn new stuff!

This time the idea of what to make comes from a suggestion of @stewie.wieno, who asked me if, using Python, it would be feasible to create something that could make possible the creation of a sort of curation trail to support Italian users on Hive.

Therefore, my idea was to design a script that had the following features:

  • find posts with a particular tag (e.g., ita);
  • check if the post is written in Italian language;
  • check if the post has at least 500 words (or 1000 if the post is written in two languages).

If these requirements are met, the post is added to a special list.

Here the task of this first script ends.

The list can then be checked manually by one or more curators who make sure that the posts are of quality, are not spam and do not violate some Hive rule.

After that I would like to create a second script that would take the cleaned-up list and proceed to upvote the selected posts, leaving each one also a comment.

This would greatly simplify and speed up the curators' work, with the two scripts taking care of almost the entire process automatically.

Of course this is only the beginning, but building such a tool seemed like an interesting exercise, so I wanted to try this little experiment :)


And here's the code!

Below is the code for the first of the two scripts I am working on, already done and ready to be polished:


#!/usr/bin/env python3
"""A script to simplify curation on Hive"""
from beem import Hive
from beem.blockchain import Blockchain
import beem.instance
import os
import json
import markdown
from bs4 import BeautifulSoup
import re
from langdetect import detect_langs, LangDetectException as lang_e

# Instanciate Hive
HIVE_API_NODE = "https://api.deathwing.me"
HIVE = Hive(node=[HIVE_API_NODE])

beem.instance.set_shared_blockchain_instance(HIVE)


def get_block_number():

    if not os.path.exists("last_block.txt"):
        return None

    with open("last_block.txt", "r") as infile:
        block_num = infile.read()
        return int(block_num)


def set_block_number(block_num):

    with open("last_block.txt", "w") as outfile:
        outfile.write(f"{block_num}")


def convert_and_count_words(md_text):
    # Convert text from markdown to HTML
    html = markdown.markdown(md_text)

    # Get text
    soup = BeautifulSoup(html, "html.parser")
    text = soup.get_text()

    # Count text words
    words = re.findall(r"\b\w+\b", text)
    return len(words)


def text_language(text):
    # Detect languages
    try:
        languages = detect_langs(text)
    except lang_e:
        return False, 0

    # Count languages
    num_languages = len(languages)

    # Sort languages from more to less probable
    languages_sorted = sorted(languages, key=lambda x: x.prob, reverse=True)

    # Check most probable languages (up to 2)
    top_languages = (
        languages_sorted[:2] if len(languages_sorted) > 1 else languages_sorted
    )

    # Check it target language is among the top languages
    contains_target_lang = any(lang.lang == "it" for lang in top_languages)

    # Return True/False and number of languages detected
    return contains_target_lang, num_languages


def hive_comments_stream():

    blockchain = Blockchain(node=[HIVE_API_NODE])

    start_block = get_block_number()

    for op in blockchain.stream(
        opNames=["comment"], start=start_block, threading=False, thread_num=1
    ):
        set_block_number(op["block_num"])

        # Skip comments
        if op.get("parent_author") != "":
            continue

        # Check if there's the key "json_metadata"
        if "json_metadata" not in op.keys():
            continue

        # Deserialize 'json_metadata'
        json_metadata = json.loads(op["json_metadata"])

        # Check if there's the key "tags"
        if "tags" not in json_metadata:
            continue

        # Check if there's the tag we are looking for
        if "ita" not in json_metadata["tags"]:
            continue

        post_test = op.get("body")

        # Check post language
        is_valid_language, languages_num = text_language(post_test)

        if is_valid_language == False:
            continue

        # Check post length
        word_count = convert_and_count_words(post_test)

        if languages_num == 1:
            if word_count < 500:
                print("Post is too short")
                continue

        if languages_num > 1:
            if word_count < 1000:
                print("Post is too short")
                continue

        # data of the post
        post_author = op["author"]
        post_permlink = op["permlink"]
        post_url = f"https://peakd.com/@{post_author}/{post_permlink}"
        terminal_message = (
            f"Found eligible post: " f"{post_url} " f"in block {op['block_num']}"
        )
        print(terminal_message)

        with open("urls", "a", encoding="utf-8") as file:
            file.write(post_url + "\n")


if __name__ == "__main__":

    hive_comments_stream()



This time there are no templates or configuration files.

Like last time I would be very happy to receive suggestions and advice on how to make the code even more efficient and correct :)


images property of their respective owners

to support the #OliodiBalena community, @balaenoptera is 3% beneficiary of this post


If you've read this far, thank you! If you want to leave an upvote, a reblog, a follow, a comment... well, any sign of life is really much appreciated!


Versione italiana

Italian version


cover

Python e Hive: uno Strumento per Semplificare l'Attività di Curation | Lavori in Corso!

Il mio primo progetto scritto in Python ha riguardato la creazione di un piccolo bot che potesse upvotare e commentare i post sotto cui io abbia lasciato in precedenza un commento contenente una determinata parola chiave: la sua utilità è quella di poter utilizzare un solo account per decidere come e se upvotare un post con il mio account secondario.

Alle volte infatti potrei voler upvotare solo con il mio account principale, altre solo con quello secondario, altre ancora con entrambi ma con percentuali diverse... ecco perchè configurare una curation trail con hive.vote in questi casi potrebbe essere troppo limitante, mentre avere un proprio bot personalizzato che consenta, di volta in volta, di scegliere cosa fare permette di essere molto più flessibili ed evitare di sprecare preziosi upvotes.

Rispetto al codice condiviso la scorsa volta ho apportato alcune piccole migliorie, alcune suggeritemi da altri utenti, altre aggiunte per rendere il codice più robusto e meno incline a crash imprevisti.

Ora sto rifinendo alcune ultime piccole cose, ma intanto potete già trovare lo script su GitHub... o almeno potrete trovarlo appena avrò impostato la privacy su "pubblica" 😂 per cui se clicclate sul link a poca distanza dalla pubblicazione di questo post non vedrete, purtroppo, ancora nulla.


Adesso si passa ad un nuovo progetto!

Finito (quasi) il primo progetto, è tempo di passare a qualcosa di diverso, nell'ottica di provare ad imparare cose sempre nuove!

Stavolta l'idea di cosa realizzare deriva da un suggerimento di @stewie.wieno, che mi ha chiesto se, sfruttando Python, fosse possibile creare qualcosa che potesse agevolare la creazione di una sorta di curation trail a sostegno degli utenti italiani su Hive.

La mia idea è stata perciò quella di progettare uno script che avesse le seguenti funzioni:

  • individuare i post muniti di un particolare tag (es. ita);
  • controllare che il post sia scritto in lingua italiana;
  • controllare che il post abbia almeno 500 parole (o 1000 se il post è scritto in due lingue).

Se questi requisiti sono soddisfatti il post viene aggiunto ad un'apposita lista.

Qui finisce il compito di questo primo script.

La lista può così essere controllata manualmente da uno o più curatori che si accertino che i post siano di qualità, non siano spam e non violino qualche regola di Hive.

Dopo di che vorrei creare un secondo script che si occupi di prendere la lista ripulita e proceda ad upvotare i post selezionati, lasciando a ciascuno un commento informativo.

In questo modo il lavoro dei curatori sarebbe notevolmente semplificato e velocizzato, occupandosi i due script di praticamente tutta la procedura in maniera automatizzata.

Ovviamente questo è solo un inizio, ma costruire uno strumento del genere sembrava un esercizio interessante, per cui ho voluto provare a fare questo piccolo esperimento :)


Ed ecco il codice!

A seguire il codice del primo dei due script a cui sto lavorando, già funzionante e pronto per essere rifinito:


#!/usr/bin/env python3
"""A script to simplify curation on Hive"""
from beem import Hive
from beem.blockchain import Blockchain
import beem.instance
import os
import json
import markdown
from bs4 import BeautifulSoup
import re
from langdetect import detect_langs, LangDetectException as lang_e

# Instanciate Hive
HIVE_API_NODE = "https://api.deathwing.me"
HIVE = Hive(node=[HIVE_API_NODE])

beem.instance.set_shared_blockchain_instance(HIVE)


def get_block_number():

    if not os.path.exists("last_block.txt"):
        return None

    with open("last_block.txt", "r") as infile:
        block_num = infile.read()
        return int(block_num)


def set_block_number(block_num):

    with open("last_block.txt", "w") as outfile:
        outfile.write(f"{block_num}")


def convert_and_count_words(md_text):
    # Convert text from markdown to HTML
    html = markdown.markdown(md_text)

    # Get text
    soup = BeautifulSoup(html, "html.parser")
    text = soup.get_text()

    # Count text words
    words = re.findall(r"\b\w+\b", text)
    return len(words)


def text_language(text):
    # Detect languages
    try:
        languages = detect_langs(text)
    except lang_e:
        return False, 0

    # Count languages
    num_languages = len(languages)

    # Sort languages from more to less probable
    languages_sorted = sorted(languages, key=lambda x: x.prob, reverse=True)

    # Check most probable languages (up to 2)
    top_languages = (
        languages_sorted[:2] if len(languages_sorted) > 1 else languages_sorted
    )

    # Check it target language is among the top languages
    contains_target_lang = any(lang.lang == "it" for lang in top_languages)

    # Return True/False and number of languages detected
    return contains_target_lang, num_languages


def hive_comments_stream():

    blockchain = Blockchain(node=[HIVE_API_NODE])

    start_block = get_block_number()

    for op in blockchain.stream(
        opNames=["comment"], start=start_block, threading=False, thread_num=1
    ):
        set_block_number(op["block_num"])

        # Skip comments
        if op.get("parent_author") != "":
            continue

        # Check if there's the key "json_metadata"
        if "json_metadata" not in op.keys():
            continue

        # Deserialize 'json_metadata'
        json_metadata = json.loads(op["json_metadata"])

        # Check if there's the key "tags"
        if "tags" not in json_metadata:
            continue

        # Check if there's the tag we are looking for
        if "ita" not in json_metadata["tags"]:
            continue

        post_test = op.get("body")

        # Check post language
        is_valid_language, languages_num = text_language(post_test)

        if is_valid_language == False:
            continue

        # Check post length
        word_count = convert_and_count_words(post_test)

        if languages_num == 1:
            if word_count < 500:
                print("Post is too short")
                continue

        if languages_num > 1:
            if word_count < 1000:
                print("Post is too short")
                continue

        # data of the post
        post_author = op["author"]
        post_permlink = op["permlink"]
        post_url = f"https://peakd.com/@{post_author}/{post_permlink}"
        terminal_message = (
            f"Found eligible post: " f"{post_url} " f"in block {op['block_num']}"
        )
        print(terminal_message)

        with open("urls", "a", encoding="utf-8") as file:
            file.write(post_url + "\n")


if __name__ == "__main__":

    hive_comments_stream()



Stavolta non ci sono templates o file di configurazione.

Come l'altra volta sarei felicissimo di ricevere suggerimenti e consigli sul come rendere il codice ancora più efficiente e corretto :)


immagini di proprietà dei rispettivi proprietari

a supporto della community #OliodiBalena, il 3% delle ricompense di questo post va a @balaenoptera

Se sei arrivato a leggere fin qui, grazie! Se hai voglia di lasciare un upvote, un reblog, un follow, un commento... be', un qualsiasi segnale di vita, in realtà, è molto apprezzato!

H2
H3
H4
3 columns
2 columns
1 column
25 Comments