数据解析

mac2022-06-30 156

数据解析流程

1.指定url 2.发起请求 3.获取页面数据 4.解析数据 5.持久化存储

三种数据解析方式：正则，xpath，bs4

正则

import re # 提取出python key = 'javapython-php' re.findall('python',key) re.findall('python',key)[0] # 提取helloworld key = '<html><h1>hello world</h1></html>' re.findall('<h1>(.*?)</h1>',key)[0] # 提取 170 string = '我喜欢身高170的女生' re.findall('\d+',string)[0] # 提取出http:// 和https:// key = 'http://www.baidu.com and https://bobo.com' #方法一 ? 出现0次或一次 re.findall('https?://',key) # 方法二 re.findall('https{0,1}://',key) # 提取出hit. key = 'bobo@hit.edu.com' re.findall('h.*\.',key) #['hit.edu.'] 贪婪模式下，尽可能多的匹配 re.findall('h.*?\.',key) # 加一个问号，切换到非贪婪模式 # 匹配sas或者saaas key = 'saas and sas saaas' re.findall('sa{1,2}s',key) # 匹配出i开头的行 re.S-基于单行匹配 re.M-基于多行匹配 key ='''fall in love with you i love you very much i love you i love you ''' re.findall('^i.*',key,re.M) # 匹配所有的行 key = '''<div>静夜思窗前明月光疑是地上霜举头望明月低头思故乡 </div>''' re.findall('<div>.*</div>',key,re.S)

练习

1 import requests 2 import re 3 import os 4 # 指定url 5 url = 'https://www.qiushibaike.com/pic/' 6 # 自定义请求头信息 7 headers={ 8 'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' 9 10 } 11 # 发起请求 12 response=requests.get(url=url,headers=headers) 13 # 获取页面数据 14 page_text = response.text 15 # 数据解析 16 img_list=re.findall('<div class="thumb">.*?<img src="(.*?)".*?>.*?</div>',page_text,re.S) 17 18 #创建一个存储图片的文件夹 19 if not os.path.exists('./imgs'): 20 os.makedirs('./imgs') 21 22 for url in img_list: 23 img_url='https:'+url 24 img_data=requests.get(url=img_url,headers=headers).content 25 imgName = url.split('/')[-1] 26 imgPath = 'imgs/'+imgName 27 with open(imgPath,'wb')as fp: 28 fp.write(img_data) 29 print(imgName+"写入成功")

转载于:https://www.cnblogs.com/yuliangkaiyue/p/10001544.html

最新回复(0)

数据解析

数据解析流程

三种数据解析方式 ：正则，xpath，bs4

正则

练习

三种数据解析方式：正则，xpath，bs4