Scrapy in Docker Install

Hi! Today we will install Scrapy and run a simple spider. You can find many articles about how to install in virtualenv but we install scrapy in docker container. First of all install docker. I will write commands for Ubuntu, if it will be interesting I can repeat this article for windows. Altought I already installed docker this information is from docker documentation:

Image Credit

  • Install packages to allow apt to use a repository over HTTPS:

$ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

  • Add Docker’s official GPG key:

$ curl -fsSL  https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

  • Use the following command to set up the stable repository. You always need the stable repository, even if you want to install edge builds as well.

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

  • Update the apt package index.

$ sudo apt-get update

  • Install the latest version of Docker

$ sudo apt-get install docker-ce

  •  Verify that Docker CE is installed correctly by running the hello-world image.

$ sudo docker run hello-world

This command downloads a test image and runs it in a container. When the container runs, it prints an informational message and exits.


Now let’s make simple Scrapy container

  • Make container directory in Home

$ mkdir ~/Scrapy

$ cd Scrapy

  • Write instructions to build container

$ nano Dockerfile


# Version: 0.0.1

FROM python


RUN apt-get update && apt-get upgrade -y

RUN pip install --upgrade pip

RUN pip install scrapy


Now we’re creating a shared volume which will be mounted to container and place test spider to that volume

$ mkdir ~/Scrapy/scrapy-data

$ cd ~/Scrapy/scrapy-data

$ sudo nano


import scrapy


class QuotesSpider(scrapy.Spider):

    name = "quotes"

    start_urls = [

        ' http://quotes.toscrape.com/tag/humor/',

    ]


    def parse(self, response):

        for quote in response.css('div.quote'):

            yield {

                'text': quote.css('span.text::text').extract_first(),

                'author': quote.xpath('span/small/text()').extract_first(),

            }


        next_page = response.css('li.next a::attr("href")').extract_first()

        if next_page is not None:

            yield response.follow(next_page, self.parse)


  • Build image

$ sudo docker build -t scrapy .

  • And run it

$ sudo docker run -v ~/Scrapy/scrapy-data:/scrapy scrapy  scrapy runspider /scrapy/quotes_spider.py -o /scrapy/quotes.json


-v ~/Scrapy/scrapy-data:/scrapy means that in our container directory ~/Scrapy will be created a shared volume that is mounted to /scrapy of container. 

scrapy runspider /scrapy/quotes_spider.py -o /scrapy/quotes.json - command that will be run in container


file  ~/Scrapy/scrapy-data/quotes.json will contain the result of executing our spider. As a result we have an environment to write our spiders for steemit, and it will be theme of the next article. 

Thank you.

 


H2
H3
H4
3 columns
2 columns
1 column
Join the conversation now
Logo
Center