An open source and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.
Install the latest version of Scrapy
Scrapy 1.0
pip install scrapy
Terminal•
pip install scrapy
cat > myspider.py <<EOF
importscrapyclassBlogSpider(scrapy.Spider):name='blogspider'start_urls=['https://blog.scrapinghub.com']defparse(self,response):forurlinresponse.css('ul li a::attr("href")').re('.*/category/.*'):yieldscrapy.Request(response.urljoin(url),self.parse_titles)defparse_titles(self,response):forpost_titleinresponse.css('div.entries > ul > li a::text').extract():yield{'title':post_title}
EOF
scrapy runspider myspider.py
Build and run your web spiders
Terminal•
shub login
Insert your Scrapinghub API Key: <API_KEY># Deploy the spider to Scrapy Cloud shub deploy
# Schedule the spider for execution shub schedule blogspider
Spider blogspider scheduled, watch it running here:
https://dash.scrapinghub.com/p/26731/job/1/8# Retrieve the scraped data shub items 26731/1/8
{"title":"Black Friday, Cyber Monday: Are They Worth It?"}{"title":"Tips for Creating a Cohesive Company Culture Remotely"}...