Hi! Today we will install Scrapy and run a simple spider. You can find many articles about how to install in virtualenv but we install scrapy in docker container. First of all install docker. I will write commands for Ubuntu, if it will be interesting I can repeat this article for windows. Altought I already installed docker this information is from docker documentation:
- Install packages to allow apt to use a repository over HTTPS:
$ sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
- Add Docker’s official GPG key:
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
- Use the following command to set up the stable repository. You always need the stable repository, even if you want to install edge builds as well.
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
- Update the apt package index.
$ sudo apt-get update
- Install the latest version of Docker
$ sudo apt-get install docker-ce
- Verify that Docker CE is installed correctly by running the hello-world image.
$ sudo docker run hello-world
This command downloads a test image and runs it in a container. When the container runs, it prints an informational message and exits.
Now let’s make simple Scrapy container
- Make container directory in Home
$ mkdir ~/Scrapy
$ cd Scrapy
- Write instructions to build container
$ nano Dockerfile
# Version: 0.0.1
FROM python
RUN apt-get update && apt-get upgrade -y
RUN pip install --upgrade pip
RUN pip install scrapy
Now we’re creating a shared volume which will be mounted to container and place test spider to that volume
$ mkdir ~/Scrapy/scrapy-data
$ cd ~/Scrapy/scrapy-data
$ sudo nano
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
' http://quotes.toscrape.com/tag/humor/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
- Build image
$ sudo docker build -t scrapy .
- And run it
$ sudo docker run -v ~/Scrapy/scrapy-data:/scrapy scrapy scrapy runspider /scrapy/quotes_spider.py -o /scrapy/quotes.json
-v ~/Scrapy/scrapy-data:/scrapy
means that in our container directory ~/Scrapy
will be created a shared volume that is mounted to /scrapy
of container.
scrapy runspider /scrapy/quotes_spider.py -o /scrapy/quotes.json
- command that will be run in container
file ~/Scrapy/scrapy-data/quotes.json
will contain the result of executing our spider. As a result we have an environment to write our spiders for steemit, and it will be theme of the next article.
Thank you.