参考文章: https://www.cnblogs.com/liuqingzheng/articles/10261760.html
1,安装:
#Windows平台
1、pip3 install wheel
#安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
3
、pip3 install lxml
4
、pip3 install pyopenssl
5、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
6、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/
#搜索twisted,版本与python版本对应
7、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-
win_amd64.whl
8
、pip3 install scrapy
#Linux平台
1、pip3 install scrapy
2,命令行
#1 查看帮助
scrapy -
h
scrapy <command> -
h
#2 有两种命令:其中Project-only必须切到项目文件夹下才能执行,而Global的命令则不需要
Global commands:
startproject * #创建项目
genspider *
#创建爬虫程序 1,cd myscrapy | scrapy genspider tmall www.tmall.com
settings
#如果是在项目目录下,则得到的是该项目的配置
runspider
#运行一个独立的python文件,不必创建项目
shell
#scrapy shell url地址 在交互式调试,如选择器规则正确与否
fetch
#独立于程单纯地爬取一个页面,可以拿到请求头
view
#下载完毕后直接弹出浏览器,以此可以分辨出哪些数据是ajax请求
version *
#scrapy version 查看scrapy的版本,scrapy version -v查看scrapy依赖库的版本
Project-
only commands:
crawl * # scrapy crawl tamll 运行爬虫,必须创建项目才行,确保配置文件中ROBOTSTXT_OBEY = False
check
#检测项目中有无语法错误
list
#列出项目中所包含的爬虫名
edit
#编辑器,一般不用
parse
#scrapy parse url地址 --callback 回调函数 #以此可以验证我们的回调函数是否正确
bench
#scrapy bentch压力测试
#3 官网链接
https://docs.scrapy.org/en/latest/topics/commands.html
3, 项目结构以及爬虫应用简介
project_name/
scrapy.cfg
project_name/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
爬虫1.py
爬虫2.py
爬虫3.py
文件说明:
scrapy.cfg 项目的主配置信息,用来部署scrapy时使用,爬虫相关的配置信息在settings.py文件中。items.py 设置数据存储模板,用于结构化数据,如:Django的Modelpipelines 数据处理行为,如:一般结构化的数据持久化settings.py 配置文件,如:递归的层数、并发数,延迟下载等。强调:配置文件的选项必须大写否则视为无效,正确写法USER_AGENT='xxxx'spiders 爬虫目录,如:创建文件,编写爬虫规则
注意:一般创建爬虫文件时,以网站域名命名
#tamll.py# -*- coding: utf-8 -*-
import scrapy
from urllib.parse
import urlencode
from ..items
import MyprojectItem
class TmallSpider(scrapy.Spider):
name =
'tmall'
allowed_domains = [
'www.tmall.com',
'httpbin.org']
#爬取链接
# 本质上走的是父类的start_requests,因为自己重写了start_requests,就不需要在这里定义url了,
# start_urls = ['http://httpbin.org/get']
#自定义请求头
custom_settings =
{
'DEFAULT_REQUEST_HEADERS': {
'Accept':
'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language':
'en',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
}
def __init__(self,pro=None,*args,**
kwargs):
super(TmallSpider,self).__init__(*args,**
kwargs)
self.params =
{
'q':pro,
'totalPage':1
,
'jumpto':1
,
}
self.start_urls =
'http://list.tmall.com/search_product.htm?' +
urlencode(self.params)
#重写父类start_requests
def start_requests(self):
# for url in self.start_urls:
yield scrapy.Request(url=self.start_urls,callback=self.get_totallpage,dont_filter=True)
#dont_filter=True 不去重
#解析函数(初次请求都会走parse)
def get_totallpage(self, response):
# print('我被解析了')
# print(response.text)
res = response.css(
'[name="totalPage"]::attr(value)').extract_first()
self.params['totalPage'] =
int(res)
for i
in range(1,self.params[
'totalPage']+1
):
# for i in range(1,2):
self.params[
'jumpto'] =
i
self.url =
'http://list.tmall.com/search_product.htm?' +
urlencode(self.params)
yield scrapy.Request(url=self.url,callback=self.get_info,dont_filter=
True)
def get_info(self,response):
elements = response.css(
'[class="product "]')
for element
in elements:
title = element.css(
'[class="productTitle"] a::attr(title)').extract_first()
price = element.css(
'[class="productPrice"] em::attr(title)').extract_first()
print(title,price)
item =
MyprojectItem()
item.title =
title
item.price =
price
yield item
#items.pyclass MyprojectItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title =
scrapy.Field()
price = scrapy.Field()
配置地理IP:
#middleware.py
class MyprojectDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s =
cls()
crawler.signals.connect(s.spider_opened, signal=
signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
# return None
#配置代理IP
proxy = requests.get(url=
'http://127.0.0.1:5010/get').text
request.meta['proxy'] =
'http://' +
proxy
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
#如果ip被封了,则删除被封IP,换新IP重新请求
old_ip = request.meta[
'proxy'].split(
'//')[1
]
requests.get('http://127.0.0.1:5010/delete/?proxy={}'.format(old_ip))
proxy = request.get(url=
'http://127.0.0.1:5010/get').text
request.meta['proxy'] =
'http://' +
proxy
return request
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
执行文件:
#在项目目录下新建:run.py
from scrapy.cmdline import executeexecute(['scrapy', 'crawl', 'tmall','-a','pro=男装','--nolog']) #没有日志# execute(['scrapy', 'crawl', 'tmall','-a','pro=男装'])# execute(['scrapy', 'crawl', 'tmall'])
转载于:https://www.cnblogs.com/HZLS/p/11551299.html
相关资源:JAVA上百实例源码以及开源项目