scrapy

mac2022-06-30 81

参考文章： https://www.cnblogs.com/liuqingzheng/articles/10261760.html

1，安装：

#Windows平台 1、pip3 install wheel #安装后，便支持通过wheel文件安装软件，wheel文件官网：https://www.lfd.uci.edu/~gohlke/pythonlibs 3、pip3 install lxml 4、pip3 install pyopenssl 5、下载并安装pywin32：https://sourceforge.net/projects/pywin32/files/pywin32/ 6、下载twisted的wheel文件：http://www.lfd.uci.edu/~gohlke/pythonlibs/ #搜索twisted，版本与python版本对应 7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl 8、pip3 install scrapy #Linux平台 1、pip3 install scrapy

2，命令行

#1 查看帮助 scrapy -h scrapy <command> -h #2 有两种命令：其中Project-only必须切到项目文件夹下才能执行，而Global的命令则不需要 Global commands: startproject * #创建项目 genspider * #创建爬虫程序 1，cd myscrapy | scrapy genspider tmall www.tmall.com settings #如果是在项目目录下，则得到的是该项目的配置 runspider #运行一个独立的python文件，不必创建项目 shell #scrapy shell url地址在交互式调试，如选择器规则正确与否 fetch #独立于程单纯地爬取一个页面，可以拿到请求头 view #下载完毕后直接弹出浏览器，以此可以分辨出哪些数据是ajax请求 version * #scrapy version 查看scrapy的版本，scrapy version -v查看scrapy依赖库的版本 Project-only commands: crawl * # scrapy crawl tamll 运行爬虫，必须创建项目才行，确保配置文件中ROBOTSTXT_OBEY = False check #检测项目中有无语法错误 list #列出项目中所包含的爬虫名 edit #编辑器，一般不用 parse #scrapy parse url地址 --callback 回调函数 #以此可以验证我们的回调函数是否正确 bench #scrapy bentch压力测试 #3 官网链接 https://docs.scrapy.org/en/latest/topics/commands.html

3, 项目结构以及爬虫应用简介

project_name/ scrapy.cfg project_name/ __init__.py items.py pipelines.py settings.py spiders/ __init__.py 爬虫1.py 爬虫2.py 爬虫3.py

文件说明：

scrapy.cfg 项目的主配置信息，用来部署scrapy时使用，爬虫相关的配置信息在settings.py文件中。items.py 设置数据存储模板，用于结构化数据，如：Django的Modelpipelines 数据处理行为，如：一般结构化的数据持久化settings.py 配置文件，如：递归的层数、并发数，延迟下载等。强调:配置文件的选项必须大写否则视为无效，正确写法USER_AGENT='xxxx'spiders 爬虫目录，如：创建文件，编写爬虫规则

注意：一般创建爬虫文件时，以网站域名命名

#tamll.py# -*- coding: utf-8 -*- import scrapy from urllib.parse import urlencode from ..items import MyprojectItem class TmallSpider(scrapy.Spider): name = 'tmall' allowed_domains = ['www.tmall.com','httpbin.org'] #爬取链接 # 本质上走的是父类的start_requests，因为自己重写了start_requests，就不需要在这里定义url了， # start_urls = ['http://httpbin.org/get'] #自定义请求头 custom_settings = { 'DEFAULT_REQUEST_HEADERS': { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36' } } def __init__(self,pro=None,*args,**kwargs): super(TmallSpider,self).__init__(*args,**kwargs) self.params = { 'q':pro, 'totalPage':1, 'jumpto':1, } self.start_urls = 'http://list.tmall.com/search_product.htm?' + urlencode(self.params) #重写父类start_requests def start_requests(self): # for url in self.start_urls: yield scrapy.Request(url=self.start_urls,callback=self.get_totallpage,dont_filter=True) #dont_filter=True 不去重 #解析函数（初次请求都会走parse） def get_totallpage(self, response): # print('我被解析了') # print(response.text) res = response.css('[name="totalPage"]::attr(value)').extract_first() self.params['totalPage'] = int(res) for i in range(1,self.params['totalPage']+1): # for i in range(1,2): self.params['jumpto'] = i self.url = 'http://list.tmall.com/search_product.htm?' + urlencode(self.params) yield scrapy.Request(url=self.url,callback=self.get_info,dont_filter=True) def get_info(self,response): elements = response.css('[class="product "]') for element in elements: title = element.css('[class="productTitle"] a::attr(title)').extract_first() price = element.css('[class="productPrice"] em::attr(title)').extract_first() print(title,price) item = MyprojectItem() item.title = title item.price = price yield item #items.pyclass MyprojectItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() price = scrapy.Field()

配置地理IP：

#middleware.py class MyprojectDownloaderMiddleware(object): # Not all methods need to be defined. If a method is not defined, # scrapy acts as if the downloader middleware does not modify the # passed objects. @classmethod def from_crawler(cls, crawler): # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s def process_request(self, request, spider): # Called for each request that goes through the downloader # middleware. # Must either: # - return None: continue processing this request # - or return a Response object # - or return a Request object # - or raise IgnoreRequest: process_exception() methods of # installed downloader middleware will be called # return None #配置代理IP proxy = requests.get(url='http://127.0.0.1:5010/get').text request.meta['proxy'] = 'http://' + proxy return None def process_response(self, request, response, spider): # Called with the response returned from the downloader. # Must either; # - return a Response object # - return a Request object # - or raise IgnoreRequest return response def process_exception(self, request, exception, spider): # Called when a download handler or a process_request() # (from other downloader middleware) raises an exception. # Must either: # - return None: continue processing this exception # - return a Response object: stops process_exception() chain # - return a Request object: stops process_exception() chain #如果ip被封了，则删除被封IP，换新IP重新请求 old_ip = request.meta['proxy'].split('//')[1] requests.get('http://127.0.0.1:5010/delete/?proxy={}'.format(old_ip)) proxy = request.get(url='http://127.0.0.1:5010/get').text request.meta['proxy'] = 'http://' + proxy return request def spider_opened(self, spider): spider.logger.info('Spider opened: %s' % spider.name)

执行文件：

#在项目目录下新建：run.py from scrapy.cmdline import executeexecute(['scrapy', 'crawl', 'tmall','-a','pro=男装','--nolog']) #没有日志# execute(['scrapy', 'crawl', 'tmall','-a','pro=男装'])# execute(['scrapy', 'crawl', 'tmall'])

转载于:https://www.cnblogs.com/HZLS/p/11551299.html

相关资源：JAVA上百实例源码以及开源项目

最新回复(0)