Scrapy in Docker Install

Hi! Today we will install Scrapy and run a simple spider. You can find many articles about how to install in virtualenv but we install scrapy in docker container. First of all install docker. I will write commands for Ubuntu, if it will be interesting I can repeat this article for windows. Altought I already installed docker this information is from docker documentation:

Image Credit

Install packages to allow apt to use a repository over HTTPS:

$ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common

Add Docker’s official GPG key:

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Use the following command to set up the stable repository. You always need the stable repository, even if you want to install edge builds as well.

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

Update the apt package index.

$ sudo apt-get update

Install the latest version of Docker

$ sudo apt-get install docker-ce

Verify that Docker CE is installed correctly by running the hello-world image.

$ sudo docker run hello-world

This command downloads a test image and runs it in a container. When the container runs, it prints an informational message and exits.

Now let’s make simple Scrapy container

Make container directory in Home

$ mkdir ~/Scrapy

$ cd Scrapy

Write instructions to build container

$ nano Dockerfile

# Version: 0.0.1

FROM python

RUN apt-get update && apt-get upgrade -y

RUN pip install --upgrade pip

RUN pip install scrapy

Now we’re creating a shared volume which will be mounted to container and place test spider to that volume

$ mkdir ~/Scrapy/scrapy-data

$ cd ~/Scrapy/scrapy-data

$ sudo nano

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

' http://quotes.toscrape.com/tag/humor/',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text': quote.css('span.text::text').extract_first(),

'author': quote.xpath('span/small/text()').extract_first(),

}

next_page = response.css('li.next a::attr("href")').extract_first()

if next_page is not None:

yield response.follow(next_page, self.parse)

Build image

$ sudo docker build -t scrapy .

And run it

$ sudo docker run -v ~/Scrapy/scrapy-data:/scrapy scrapy scrapy runspider /scrapy/quotes_spider.py -o /scrapy/quotes.json

-v ~/Scrapy/scrapy-data:/scrapy means that in our container directory ~/Scrapy will be created a shared volume that is mounted to /scrapy of container.

scrapy runspider /scrapy/quotes_spider.py -o /scrapy/quotes.json - command that will be run in container

file ~/Scrapy/scrapy-data/quotes.json will contain the result of executing our spider. As a result we have an environment to write our spiders for steemit, and it will be theme of the next article.

Thank you.