This content was deleted by the author. You can see it from Blockchain History logs.

Way to parse steemit (scraping dynamically generated frontend)

Hello! 

Our today's story will be about scraping dynamically generated frontend. Current Web technologies runs a part of the code on a client (browser) side. This technologies made the websites more flexible, reducing the server-side load, allowing to download content dynamically. 

If your heard about ReactJS,  Angular, Ember, Backbone this is all about dynamically generated frontend. For example steemit is written on ReactJS. But advantages for users are disadvantages for scrappers. For example in steemit case only few articles of user blog are initially loaded, to see more articles you must scroll down a page. Simple action for user is not so simple for spider. 

To deal with this problem scraping frameworks interacts with different browser automation tools. The most famous tool of this kind  is Selenium which primary task is automated tests of web pages. But this method needs an active browser. Another alternative is Splash. Splash is a browser but without GUI that wrapped in docker container. This browser controlled through HTTP API. And one more important thing is that Scrapy has a plugin for Splash. This is example from Scrapy-Splah documentation that allows to understand how to integrate Splash

Image Credit

import scrapy

from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):

    start_urls = ["http://example.com", "http://example.com/foo"]

    def start_requests(self):

        for url in self.start_urls:

            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):

        # response.body is a result of render.html call; it

        # contains HTML processed by a browser.

        # ...

Looks great. 

In the next article we will try to parse our steemit using Scrapy-Splash.

Logo
Center