python爬虫之下载盗墓笔记（bs4解析HTML）

mac2022-06-30 74

前言：

最近一个作业用到爬虫，我爬取的网站是拉勾网，返回的是json格式，我就用字典的形式获取数据了

这次顺便把bs4解析返回的HTML格式也熟悉一下

爬了一个简单的网站：http://www.seputu.com

学习了下https://www.cnblogs.com/insane-Mr-Li/p/9117005.html的内容，自己动手开始搞了，基本原理差不多

又想起盗墓笔记无数未填的深坑。。。

记下主要用法：

通过检查元素可以看到每一节的链接和名字都在<li></li>里存着了

所以第一步通过bs4找到这些<li></li>

import requests from bs4 import BeautifulSoup url='http://www.seputu.com' response = requests.get(url) req_parser = BeautifulSoup(response.text,features="html.parser")#<class 'bs4.BeautifulSoup'> li = req_parser.find_all('li')#<class 'bs4.element.ResultSet'> #li = req_parser.findAll('li')#等价上一句

接下来获取链接和名字，获取有两种方法，大同小异：

1.用find方法，li的类型是<class 'bs4.element.ResultSet'>，i的类型是<class 'bs4.element.Tag'>，没有find_all方法

name_list=[] href_list=[] for i in li: try: href=i.find('a')['href'] name=i.find('a').text name_list.append(name) href_list.append(href) except: pass

2.转化 li类型为<class 'bs4.BeautifulSoup'>，继续使用find_all方法在li结果里搜索

temp = BeautifulSoup(str(li),features="html.parser")#进行进一步的字符解析因为获取要素类型的值时必须进行这一步 a = temp.find_all('a') name_list=[] href_list=[] for i in a: name=i.string href=i['href'] name_list.append(name) href_list.append(href)

此处获取<a></a>之间的内容是通过属性text或者string获取

还可以通过findChildren方法获取

i.find('a').findChildren(text=True)[0]

有了名字和链接，接下来就是从链接里找文字了：

同样通过检查文字元素所在位置发现小说文字都是在<div class="content-body">的<p></p>中

response=requests.get(href_list[page]) req_parser= BeautifulSoup(response.content.decode('utf-8'),features="html.parser") div= req_parser.find_all('div',class_="content-body") #div= req_parser.find_all('div',{"class":"content-body")#等价上一句

后面再从div里找p，跟前面的道理是一样的，就不赘述了。

完整代码：

# -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup url='http://www.seputu.com' response = requests.get(url) req_parser = BeautifulSoup(response.content.decode('utf-8'),features="html.parser") li = req_parser.find_all('li') temp = BeautifulSoup(str(li),features="html.parser")#进行进一步的字符解析因为获取要素类型的值时必须进行这一步 a = temp.find_all('a') name_list=[] href_list=[] for i in a: name=i.string href=i['href'] name_list.append(name) href_list.append(href) def download(page): response=requests.get(href_list[page]) req_parser= BeautifulSoup(response.content.decode('utf-8'),features="html.parser") div= req_parser.find_all('div',class_="content-body") temp = BeautifulSoup(str(div),features="html.parser") temp=temp.find_all('p') text = [] for i in temp: temp=i.string if temp!=None: print(temp.encode('gbk','ignore').decode('gbk','ignore')) text.append(temp) with open('novel.txt','a+',encoding='utf-8') as f: f.write(name_list[page]) f.write('\n') for i in text: f.write(i) f.write('\n') for i in range(len(href_list)): try: download(i) except: pass print('%d is over'%i)

最后爬下来的txt文件有9000多行

最新回复(0)