爬虫爬取CSDN链接

mac2025-12-08 11

利用正则表达式和urllib库实现的爬取

这是小白我第一次写博客，也是第一次写爬虫，爬取了首页的部分URL（自行忽略…）。就简单记录一下此次过程。

re库

re库（正则表达式）是python3中一个很好用的匹配文本的模块，下面是re的使用规范。还有一个很重要的是，区分开( ),[ ],{ }在正则表达式中的不同和作用。链接：https://www.cnblogs.com/langren1992/p/9782191.html

来源网络，侵删

urllib.requests的使用

url.requests为python提供了比较完全的爬取网页内容的功能。 url.requests.Requests(URL,data)可以向URL网站发送请求，data是作为请求的header内容一并发送。还有很多参数，并不知道具体怎么用，就不多说。url.requests.urlopen(respond)用来读取网站返回的内容，但是返回格式是http.client.HTTPResponse，所以我们需要使用read() 方法读取url.requests.urlopen(responde)的内容。但是同时我们还需要对读取的内容进行解码，大部分编码形式是使用的“utf-8”，但是也会有个别情况，此时可以使用python chardet库进行检测，从而确定编码格式

respond = urllib.request.Request(targetUrl) respond = urllib.request.urlopen(respond) # http.client.HTTPResponse content = respond.read() content = content.decode("utf-8")

最后部分

我为了可以把爬取的内容放在txt文件中，使用open函数实现此功能。预先定义：

import re import urllib.request url = "https://blog.csdn.net/nav/python" href = open("href", "w+") html = open("html", "r")

之后，找到正确的URL地址，发现大部分都是

https://blog.csdn.net/qq_37338761/article/details/102824008 https://blog.csdn.net/qq_37338761/article/details/102824008 https://blog.csdn.net/qq_37338761/article/details/102824008

这才是我们想要的URL地址，所以可以把正则表达式写成’https?: //blog.csdn.net/\w+?/article/details/[0-9]+’

def get_href(html): content = html.read() link = re.compile(r'https?:\/\/blog.csdn.net\/\w+?\/article\/details\/[0-9]+').findall(content) for i in link: href.write(i) href.write("\n") return href

re.conpile()函数使用来定义正则表达式的匹配格式，从而实现在爬取的html中匹配到想要的URL，fillall()函数是遍历整个html，搜索出所有成功匹配的内容。

最后代码

import re import urllib.request url = "https://blog.csdn.net/nav/python" href = open("href", "w+") html = open("html", "r") def get_href(html): content = html.read() link = re.compile(r'https?:\/\/blog.csdn.net\/\w+?\/article\/details\/[0-9]+').findall(content) for i in link: href.write(i) href.write("\n") return href get_href(html) def clear_href(): href = open("href", "r+") list = [] content = href.read() content = content.split("\n") for i in content: if i not in list: list.append(i) for i in list: href1.write(i+"\n") return href1 clear_href()

这就是最后的全部代码，所用时间不是很多，所以简略的记录下来，以便以后能回忆起来，同时也希望别人能收益一二。

最新回复(0)