CrawlSpider实现微信小程序社区爬虫

mac2025-03-18  10

在新建的包目录下面创建一个爬虫项目,cmd——>scrapy startproject wxapp

创建成功后,cd wxapp

创建wxapp_spider爬虫 scrapy genspider -t crawl wxapp_spider "wxapp-union.com"

编写wxapp_spider.py代码:

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class WxappSpiderSpider(CrawlSpider): name = 'wxapp_spider' allowed_domains = ['wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=1&page=1'] rules = ( Rule(LinkExtractor(allow=r'.+mod=list&catid=1&page=\d'), follow=True), Rule(LinkExtractor(allow=r".+article-.+\.html"),callback="parse_detail",follow=False) ) def parse_detail(self, response): title = response.xpath("//h1[@class='ph']/text()").get() print(title)

在settings.py设置有关配置 默认的true改为False

把注释掉的DEFAULT_REQUEST_HEADERS不注释加上user_agent,可以在你的浏览器上copy下来

在wxapp下面添加start_project.py文件

代码如下

from scrapy import cmdline cmdline.execute("scrapy crawl wxapp_spider".split())

运行它,打印出了新闻里面的标题和访问的url

最新回复(0)