An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.
Terminal•
pip install scrapy cat > myspider.py <<EOFEOF scrapy runspider myspider.pyimport scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['http://blog.scrapinghub.com'] def parse(self, response): for url in response.css('ul li a::attr("href")').re(r'.*/\d\d\d\d/\d\d/$'): yield scrapy.Request(response.urljoin(url), self.parse_titles) def parse_titles(self, response): for post_title in response.css('div.entries > ul > li a::text').extract(): yield {'title': post_title}
Build and run your
web spiders
Terminal•
shub login Insert your Scrapinghub API Key: <API_KEY> # Deploy the spider to Scrapy Cloud shub deploy # Schedule the spider for execution shub schedule blogspider Spider blogspider scheduled, watch it running here: https://dash.scrapinghub.com/p/26731/job/1/18 # Retrieve the scraped data shub items 26731/1/8{"title": "Black Friday, Cyber Monday: Are They Worth It?"} {"title": "Tips for Creating a Cohesive Company Culture Remotely"} ...
Deploy them to
Scrapy Cloud
or use Scrapyd to host the spiders on your own server
Fast and powerful
write the rules to extract the data and let Scrapy do the rest
Easily extensible
extensible by design, plug new functionality easily without having to touch the core
Portable, Python
written in Python and runs on Linux, Windows, Mac and BSD
Healthy community
- - 11k stars, 2.8k forks and 900 watchers on GitHub
- - 2.2k followers on Twitter
- - 4.5k questions on StackOverflow
- - 2.5k members on mailing list