An open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

Install the latest version of Scrapy

Scrapy 1.0

pip install scrapy

Terminal

 pip install scrapy
 cat > myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['https://blog.scrapinghub.com']

    def parse(self, response):
        for url in response.css('ul li a::attr("href")').re('.*/category/.*'):
            yield scrapy.Request(response.urljoin(url), self.parse_titles)

    def parse_titles(self, response):
        for post_title in response.css('div.entries > ul > li a::text').extract():
            yield {'title': post_title}
EOF scrapy runspider myspider.py

Build and run your
web spiders

Terminal

 shub login
Insert your Scrapinghub API Key: <API_KEY>

# Deploy the spider to Scrapy Cloud
 shub deploy

# Schedule the spider for execution
 shub schedule blogspider 
Spider blogspider scheduled, watch it running here:
https://dash.scrapinghub.com/p/26731/job/1/8

# Retrieve the scraped data
 shub items 26731/1/8
{"title": "Black Friday, Cyber Monday: Are They Worth It?"}
{"title": "Tips for Creating a Cohesive Company Culture Remotely"}
...

Deploy them to
Scrapy Cloud

or use Scrapyd to host the spiders on your own server

Fast and powerful

write the rules to extract the data and let Scrapy do the rest

Easily extensible

extensible by design, plug new functionality easily without having to touch the core

Portable, Python

written in Python and runs on Linux, Windows, Mac and BSD

Healthy community