scrapy 的三个入门应用场景

mac2022-06-30  86

说明: 本文参照了官网的 dmoz 爬虫例子。

不过这个例子有些年头了,而 dmoz.org 的网页结构已经不同以前。所以我对xpath也相应地进行了修改。

概要: 本文提出了scrapy 的三个入门应用场景

爬取单页根据目录页面,爬取所有指向的页面爬取第一页,然后根据第一页的连接,再爬取下一页...。依此,直到结束

对于场景二、场景三可以认为都属于:链接跟随(Following links)

链接跟随的特点就是:在 parse 函数结束时,必须 yield 一个带回调函数 callback 的 Request 类的实例

本文基于:windows 7 (64) + python 3.5 (64) + scrapy 1.2

场景一

描述:

爬取单页内容

示例代码:

import scrapy from tutorial.items import DmozItem class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): for div in response.xpath('//div[@class="title-and-desc"]'): item = DmozItem() item['title'] = div.xpath('a/div/text()').extract_first().strip() item['link'] = div.xpath('a/@href').extract_first() item['desc'] = div.xpath('div[@class="site-descr "]/text()').extract_first().strip() yield item

场景二

描述:

①进入目录,提取连接。②然后爬取连接指向的页面的内容 其中①的yield scrapy.Request的callback指向②

官网描述:

...extract the links for the pages you are interested, follow them and then extract the data you want for all of them.

示例代码:

import scrapy from tutorial.items import DmozItem class DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ 'http://www.dmoz.org/Computers/Programming/Languages/Python/' # 这是目录页面 ] def parse(self, response): for a in response.xpath('//section[@id="subcategories-section"]//div[@class="cat-item"]/a'): url = response.urljoin(a.xpath('@href').extract_first().split('/')[-2]) yield scrapy.Request(url, callback=self.parse_dir_contents) def parse_dir_contents(self, response): for div in response.xpath('//div[@class="title-and-desc"]'): item = DmozItem() item['title'] = div.xpath('a/div/text()').extract_first().strip() item['link'] = div.xpath('a/@href').extract_first() item['desc'] = div.xpath('div[@class="site-descr "]/text()').extract_first().strip() yield item

场景三

描述:

①进入页面,爬取内容,并提取下一页的连接。②然后爬取下一页连接指向的页面的内容 其中①的yield scrapy.Request的callback指向①自己

官网描述:

A common pattern is a callback method that extracts some items, looks for a link to follow to the next page and then yields a Request with the same callback for it

示例代码:

import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): for h3 in response.xpath('//h3').extract(): yield MyItem(title=h3) for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)

说明: 第三个场景未测试!

转载于:https://www.cnblogs.com/hhh5460/p/5821501.html

最新回复(0)