TABLE OF CONTENTS



What is Scrapy?

Scrapy is a Python framework for web scraping, maintained by Zyte


Our View on Scrapy

Usage Rating of Scrapy


1. BEST CHOICE: This is among the preferred tools we use


Scrapy is our best choice for every website that doesn't have any particular website anti-bot tool. It's the de-facto standard in the industry for webscraping in python.


Configurations

With a proper default headers setting, a small number of concurrent requests on the website and a delay between them, you can scrape many of the common websites.


Inside a standard settings.py file you will find the following voices:

USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"

This option is needed to identify you and your bot as a genuine user assigning a specific user agent, in this case a Chrome Browser.

ROBOTSTXT_OBEY = True

This option indicates if the scraper should follow or not the rules written in the robots.txt file on the target website. For a fair web scraping practice, should be set to True.


CONCURRENT_REQUESTS = 3

Number of concurrent requests Scrapy could make to the target website. Depending from the target dimension, this could vary but in our opinion should not be more than 10 to not overload target website servers and trigger anti-bot protection systems.


DOWNLOAD_DELAY = 1

Number of seconds of delay between the requests in each thread (thread number is specified with CONCURRENT_REQUESTS  options.



Its standard installation can be integrated with python modules that augment its powers:

  • scrapy_proxies: module to handle external lists of proxies, using them randomly and deleting not working ones
  • scrapy_splash: to render javascript code in a web page via an API
  • selenium webdriver: when you need a full headless browser working


Our standard and best practices

Please read our standards and best practices for web scraping in python before implementing a new website with Scrapy.


When to use Scrapy

Whenever possible. The first attempt to scrape a website should we always with a standard configuration of Scrapy (unless we already know it's not enough from our analysis)



Reference and documentation

Official documentation page: https://scrapy.org/

Short tutorial: https://towardsdatascience.com/a-minimalist-end-to-end-scrapy-tutorial-part-i-11e350bcdec0