python爬取新浪新闻——以新车为例

mac2025-09-05 63

爬取新浪新闻时，主题词不同，网页格式也不一样，故在此选用“新车”为主题，爬取新浪新闻的标题、发布时间、链接、具体内容以及发布作者爬取网址：http://auto.sina.com.cn/newcar/index.d.html

爬取代码如下：

####爬取新闻标题、发布时间、新闻链接 import requests from bs4 import BeautifulSoup import urllib import sys import importlib '''importlib.reload(sys) key='film' url="http://auto.sina.com.cn/newcar/index.d.html" data=urllib.request.urlopen(url).read().decode('utf-8')''' for i in range(0,2): url="http://auto.sina.com.cn/newcar/?page="+str(i+1) res=requests.get(url) res.encoding = 'utf-8'#设置编码格式为utf-8 soup = BeautifulSoup(res.text, 'html.parser') for new in soup.select('.s-left.fL.clearfix'):#BeautifulSoup提供的方法通过select选择想要的html节点类名，标签等，获取到的内容会被放到列表中 if len(new.select('h3')) > 0: #加[0]是因为select获取内容后是放在list列表中[内容,],text可以获取标签中的内容 date=new.select('.time.fL')[0].text title=new.select('h3')[0].text href=new.select('a')[0]['href'] print(str(date)+" "+title+" "+href) ### 以“名爵6十周年版车型上市售价12.48万元”，为例，爬取新闻内容 import time import requests from bs4 import BeautifulSoup info = requests.get('http://auto.sina.com.cn/newcar/x/2019-11-01/detail-iicezzrr6503390.shtml') info.encoding = 'utf-8' html = BeautifulSoup(info.text, 'html.parser') main_title=html.select('.main-title')[0].text#获取大标题 date1=html.select('.date')[0].text#获取发布时间 print(date1+" "+main_title) print("______________________________________________________________________________________") article = [] for v in html.select('.article p'): article.append(v.text.strip())#将内容添加到列表中，并去除两边特殊字符 author_info = '\n'.join(article)#将列表中内容以换行连接 print (author_info) print (html.select('.show_author')[0].text.lstrip(u'责任编辑：'))#输出编辑姓名

爬取结果：几个注意点：（1）点击红圈1标注处分析页面元素，红圈2表示选中你要获取的所有元素，包括链接、时间、标题等，然后再Elements里面分析相应的元素（ps:我一开始就是只选中标题，然后盒子一直找不对！！）（2）关于BeautifulSoup乱码问题，python3以上版本，可以用import sys import importlib两行代码解决，亲测，管用；（3）https://blog.csdn.net/qq_33722172/article/details/82469050，该链接博主讲解很详细，包括如何分析页面，步骤详细，但是我看到好多博主都是用“world”作为关键词进行实战分析的，建议大家换个词，爬虫初学者自己动手试试，然后回发现自己分析页面的能力真的会变强啊~！

最新回复(0)