爬虫是打开数据收集大门的钥匙,编写第一个网络爬虫

mac2025-11-08  8

爬虫是打开数据收集大门的钥匙。 步骤一:获取页面

"""步骤一:获取页面""" #coding:UTF-8 import requests link = "http://www.santostang.com/" headers = {"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6"} r = requests.get(link,headers = headers) print(r.text)

考虑篇幅,部分结果代码:

<!DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="UTF-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"> <title>Santos Tang</title> <meta name="description" content="Python网络爬虫:从入门到实践 官方网站及个人博客" /> <meta name="keywords" content="Python网络爬虫, Python爬虫, Python, 爬虫, 数据科学, 数据挖掘, 数据分析, santostang, Santos Tang, 唐松, Song Tang" /> <link rel="apple-touch-icon" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_32.png"> <link rel="apple-touch-icon" sizes="152x152" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_152.png"> <link rel="apple-touch-icon" sizes="167x167" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_167.png"> <link rel="apple-touch-icon" sizes="180x180" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_180.png"> <link rel="icon" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_32.png" type="image/x-icon"> <link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/css/bootstrap.min.css"> <link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/css/fontawesome.min.css"> <link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/style.css"> <link rel="pingback" href="http://www.santostang.com/xmlrpc.php" /> <style type="text/css"> a{color:#1e73be} a:hover{color:#2980b9!important} #header{background-color:#1e73be} .widget .widget-title::after{background-color:#1e73be} .uptop{border-left-color:#1e73be} #titleBar .toggle:before{background:#1e73be} </style> </head> <body> <header id="header"> <div class="avatar"><a href="http://www.santostang.com" title="Santos Tang"><img src="http://www.santostang.com/wp-content/uploads/2019/06/me.jpg" alt="Santos Tang" class="img-circle" width="50%"></a></div> <h1 id="name">Santos Tang</h1> <div class="sns"> <a href="https://weibo.com/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Weibo"><i class="fab fa-weibo"></i></a> <a href="https://www.linkedin.com/in/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Linkedin"><i class="fab fa-linkedin"></i></a> <a href="https://www.zhihu.com/people/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Zhihu"><i class="fab fa-zhihu"></i></a> <a href="https://github.com/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="GitHub"><i class="fab fa-github-alt"></i></a> </div> <div class="nav"> <ul><li><a href="http://www.santostang.com/">首页</a></li> <li><a href="http://www.santostang.com/aboutme/">关于我</a></li> <li><a href="http://www.santostang.com/python%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%e4%bb%a3%e7%a0%81/">爬虫书代码</a></li> <li><a href="http://www.santostang.com/%e5%8a%a0%e6%88%91%e5%be%ae%e4%bf%a1/">加我微信</a></li>

步骤二:提取需要的数据

"""步骤二:提取需要的数据""" #coding:UTF-8 import requests from bs4 import BeautifulSoup #从bs4这个库导入BeautifulSoup link = "http://www.santostang.com/" headers = {"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6"} r = requests.get(link,headers = headers) soup = BeautifulSoup(r.text,"lxml")#使用BeautifulSoup解析这段代码 title = soup.find("h1",class_="post-title").a.text.strip() print(title)

运行结果:

第四章 – 4.3 通过selenium 模拟浏览器抓取

步骤三:存储数据并打开文件?还记得文件操作不?不记得也没关系。

"""步骤一:获取页面""" #coding:UTF-8 import requests from bs4 import BeautifulSoup #从bs4这个库导入BeautifulSoup link = "http://www.santostang.com/" headers = {"User-Agent":"Mozilla/5.0 (Windows; U; Windows NT6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6"} r = requests.get(link,headers = headers) soup = BeautifulSoup(r.text,"lxml")#使用BeautifulSoup解析这段代码 title = soup.find("h1",class_="post-title").a.text.strip() print(title) """步骤二:提取需要的数据 pass """ """步骤三:存储数据并打开文件?还记得文件操作不?不记得也没关系。""" with open ("title.txt","a+",encoding="utf-8") as f: f.write(title)#文件和你的python放在同一个文件夹

运行结果: 按照箭头指示,点击title.txt 发现: 结果出来了。 是不是觉得网络爬虫入门也没那么简单?后面带你一起从入门到入坑再到熟能生巧! 关注我,为思考点赞!

最新回复(0)