在新建的包目录下面创建一个爬虫项目,cmd——>scrapy startproject wxapp
创建成功后,cd wxapp
创建wxapp_spider爬虫 scrapy genspider -t crawl wxapp_spider "wxapp-union.com"
编写wxapp_spider.py代码:
import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class WxappSpiderSpider(CrawlSpider): name = 'wxapp_spider' allowed_domains = ['wxapp-union.com'] start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=1&page=1'] rules = ( Rule(LinkExtractor(allow=r'.+mod=list&catid=1&page=\d'), follow=True), Rule(LinkExtractor(allow=r".+article-.+\.html"),callback="parse_detail",follow=False) ) def parse_detail(self, response): title = response.xpath("//h1[@class='ph']/text()").get() print(title)在settings.py设置有关配置 默认的true改为False
把注释掉的DEFAULT_REQUEST_HEADERS不注释加上user_agent,可以在你的浏览器上copy下来
在wxapp下面添加start_project.py文件
代码如下
from scrapy import cmdline cmdline.execute("scrapy crawl wxapp_spider".split())运行它,打印出了新闻里面的标题和访问的url