Curated Resources

Articles and Blog posts

Explains scrapy from scratch. Also gives examples on scraping Reddit, XML site and an e-commerce website(downloading images along with data)

Explains downloader middlewares through a pretty interesting use case.

Comprehensive article on how to bypass the most common anti-bot mechanisms. Demonstrates good practices by implementing reusable components, such as middlewares.

Written for Scrapy 1.0.x, does not use modern idioms (e.g. extract()[0]) but shows an example custom MongoDB item pipeline.

Uses Scrapy 1.0 and Python 2, but still relevant.

Not the prettiest spider, but article shows how to use scrapy-redis, scrapy-heroku and scrapyd to run a periodic crawler on Heroku.

Old but good. Uses [0].extract(), you should now use .extract_first().

Old articles but still relevant on how to configure polipo as an HTTP proxy to integrate a crawler with Tor network.

Books

Very in-depth book on Scrapy. It shows Scrapy 1.0.x, and is Python 2 only.

It explains practically every component and setting to get you started with Scrapy, and dive deeper if needed. There’s a very cool example of asynchronous sending of items, using engine.download and inlineCallback. It also shows how to deploy a Scrapy project to Scrapinghub’s platform. The book even includes a quite intense introduction to Twisted and nonblocking I/O programming (a very good one).

The book has a companion website which has videos for some chapters.

This book is not only about Scrapy but it has a whole chapter on Scrapy, “Chapter 6. Heavyweight Scraping with Scrapy”.

It suggests using Anaconda, but make sure to use conda-forge channel instead. There’s a nice introduction to XPath and how to use scrapy shell to test selectors. It also introduces ImagesPipeline to grab Nobel Prize winners pics, which is pretty cool, right?

Courses

Python Scrapy Tutorial - Learn how to scrape websites and build a powerful web crawler using Scrapy and Python.

Use coupon code "SCRAPY" to get 90% discount, or just follow the link above.

Free and open source web crawling framework, written in Python.

Videos

Learn how to scrape the web using the Scrapy framework with this series of short videos. Companion code.

This workshop will provide an overview of Scrapy, starting from the fundamentals and working through each new topic with hands-on examples. Participants will come away with a good understanding of Scrapy, the principles behind its design, and how to apply the best practices encouraged by Scrapy to any scraping task.

Understand why its necessary to Scrapy-ify early on, Anatomy of a Scrapy Spider, Using the interactive shell, What are items and how to use item loaders, Examples of pipelines and middlewares, Techniques to avoid getting banned, How to deploy Scrapy projects.

Scrapy tutorial video provides covers the following: What is Scrapy, Why use Scrapy - alternatives to Scrapy, Architecture, components & performance, Quick demo.

Scrapy lets you straightforwardly pull data out of the web. It helps you retry if the site is down, extract content from pages using CSS selectors (or XPath), and cover your code with tests. It downloads asynchronously with high performance. You program to a simple model, and it’s good for web APIs, too.

Python has great tools like Django and Flask for taking your database and turning it into html pages, but what if you want to take somebody else’s html pages and build a database from them? Scrapy is a library for building web spiders that will simplify your web scraping tasks immensely. Friends don’t let friends use raw urllib2.

Slides

This talk presents two key technologies that can be used: Scrapy, an open source & scalable web crawling framework, and Mr. Schemato, a new, open source semantic web validator and distiller.

Crawling technology are the basis for search engines but they also have many applications for business and for fun.

In this slides, the author shares how to solve Big Data issues using Python open source tools.

Tutorial of How to scrape (crawling) website’s content using Scrapy Python