Fork on Github

Meet Scrapy

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

Install latest version:

Scrapy 1.0

pip install scrapy

Sample Codes:

  • Scrapy 1.0
  • Scrapy 0.24 (old stable)
 pip install scrapy
 cat > myspider.py <<EOF
import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://blog.scrapinghub.com']

    def parse(self, response):
        for url in response.css('ul li a::attr("href")').re(r'.*/\d\d\d\d/\d\d/$'):
            yield scrapy.Request(response.urljoin(url), self.parse_titles)

    def parse_titles(self, response):
        for post_title in response.css('div.entries > ul > li a::text').extract():
            yield {'title': post_title}
EOF scrapy runspider myspider.py
 pip install scrapy
 cat > myspider.py <<EOF
import scrapy
from urlparse import urljoin

class Post(scrapy.Item):
    title = scrapy.Field()
class BlogSpider(scrapy.Spider):
    name, start_urls = 'blogspider', ['http://blog.scrapinghub.com']
    def parse(self, response):
        for url in response.css('ul li a::attr("href")').re(r'.*/\d\d\d\d/\d\d/$'):
            yield scrapy.Request(urljoin(response.url, url), self.parse_titles)
    def parse_titles(self, response):
        for post_title in response.css('div.entries > ul > li a::text').extract():
            yield Post(title=post_title)
EOF scrapy runspider myspider.py

Build your own webcrawlers

Fast and powerful

write the rules to extract the data and let Scrapy do the rest

Easily extensible

extensible by design, plug new functionality easily without having to touch the core

Portable, Python

written in Python and runs on Linux, Windows, Mac and BSD

Healthy community: