爬虫中几种翻页方式

mac2024-03-15  16

1.第一种是观察网页结构通过获取下一页的a标签下的链接去请求的方式

if response.xpath('//a[text()="Next »"]/@href'): next_page = response.xpath('//a[text()="Next »"]/@href').extract()[0] print('next_page',next_page) next_page = response.urljoin(next_page) # print(next_page) yield Request(next_page, callback=self.parse_second, meta={'item': item})

2.第二种页面是滑动的方式,我们可以通过观察页面遍历去翻页请求

for next_page in range(0,166): next_page = 'http://tochka3evlj3sxdv.com/?page={}&city=0&category=0&sortby=popularity&account=all&shipping-to=&shipping-from=&query='.format(next_page) next_page = response.urljoin(next_page) yield Request(next_page, callback=self.parse_sencond,meta={'item': item})

3.第三种是翻到最后一页也会存在下一页的标识,如果我们根据下一页的链接去请求翻页的话那么将会陷入死循环所以要换种思路

next_pages = response.xpath('//ul[@class="pagination pagination-sm"]/li[last()]/a/@href').extract()[0] pg = re.findall(r'pg=([\s|\S]+)',next_pages)[0] for i in range(0,int(pg)): next_page = response.url+'&pg={}'.format(i) print(next_page) yield Request(next_page, callback=self.parse_third, meta={'item': item})

4.第四种是通过post的提交参数的形式翻页的相比前面几种更复杂了,页面源码中不会存在任何的翻页链接,所以这种我们要去找参数通过提交参数去翻页

u = response.url cat = re.findall(r'&cat=([\s|\S]+)', u)[0] print(cat) page = response.xpath('//button[@class="btn btn-primary"][last()]/@value').extract()[0] print(page) for i in range(1, int(page)): print(i) formdata = { 'title':'', 'search':'1', 'searchcat':'1', 'dator':'9Ac3^nYrdUjEUa8LRdV7RVHchwj6pC(u', 'type':'all', 'payment':'all', 'priceMin':'', 'priceMax':'', 'shipsfrom':'all', 'shipsto':'all', 'field':'all', 'order':'all', 'displayname':'', 'cat':cat, 'page':str(i) } yield FormRequest(u, formdata=formdata,callback=self.parse_third, meta={'item': item})

5.第五种是通过selenium的方式翻页

res = etree.HTML(self.browser.page_source) page = res.xpath('//div[@id="page"]/a/@href') for u in page: next_url = 'https://www.baidu.com' + u print(next_url) self.browser.get(next_url)
最新回复(0)