熟悉scrapy基本操作
开始你的项目 scrapy startproject xxx然后他是这个样子的xxx 项目目录
-- xxx 项目模块
-- scrapy.cfg 项目部署配置文件
-- __init__.py
-- items.py item定义,用来定义爬虫结构,不晓得可以用文件不,管理多个爬虫结构
-- middlewares.py 定义爬取的中间件
-- pipelines.py 定义数据管道
-- settings.py 配置文件
-- spiders 放置spider的文件夹,即具体爬虫代码
创建你的spider scrapy genspider xxx_1 xxx_1.com xxx_1: 你的spider名称,执行爬虫的时候的唯一标识 xxx_1.com: 你要爬取的网站的域名,可以在代码中进行修改。你的spider是这个样子的 首先是他的路径 xxx/xxx/spiders/xxx_1.py 内容是这个样子滴
import scrapy
from xxx_spider
.items
import XxxSpiderItem
class Xxx3cSpider(scrapy
.Spider
):
name
= 'xxx'
allowed_domains
= ['xxx.com']
start_urls
= ['https://xxx.com']
def parse(self
, response
):
创建抓取结构 编辑文件xxx/items.py
import scrapy
class Xxx_1Item(scrapy
.Item
):
item_id
= scrapy
.Field
()
link
= scrapy
.Field
()
category_path
= scrapy
.Field
()
title
= scrapy
.Field
()
price_min
= scrapy
.Field
()
price_max
= scrapy
.Field
()
price_unit
= scrapy
.Field
()
main_image
= scrapy
.Field
()
编写抓取逻辑 编辑文件xxx/xxx/spiders/xxx_1.py
import scrapy
from xxx_spider
.items
import XxxSpiderItem
class Xxx3cSpider(scrapy
.Spider
):
name
= 'xxx'
allowed_domains
= ['xxx.com']
start_urls
= ['https://xxx.com']
def parse(self
, response
):
item_list
= response
.css
('#mainContent ul.b-list__items_nofooter li.s-item')
category_path
= response
.css
('div.pagecontainer nav ol li a.b-link--tertiary::text').extract
()
print(item_list
)
for item_d
in item_list
:
data
= XxxSpiderItem
()
data
['category_path'] = category_path
data
['item_id'] = item_d
.css
('div.s-item__info a.s-item__link::attr(href)').extract_first
()
data
['link'] = item_d
.css
('div.s-item__info a.s-item__link::attr(href)').extract_first
()
data
['title'] = item_d
.css
('div.s-item__info h3.s-item__title::text').extract_first
()
data
['price_min'] = item_d
.css
('div.s-item__details span.s-item__price span.ITALIC::text').extract_first
()
data
['price_max'] = item_d
.css
('div.s-item__details span.s-item__price span.ITALIC::text').extract_first
()
data
['price_unit'] = item_d
.css
('div.s-item__details span.s-item__price span.ITALIC::text').extract_first
()
data
['main_image'] = item_d
.css
('div.s-item__image img.s-item__image-img::attr(src)').extract_first
()
yield data
执行spider scrapy crawl xxx_1(你的spider名称) 执行结果会输出在终端 如下: 你的抓取结果,根据你定义的数据结构,以json的形式展示出来 下一节涉及,抓取数据如何存储