需要阅读的文档: Requests:http://cn.python-requests.org/zh_CN/latest/user/quickstart.html BeautifulSoup:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/说明: 爬虫入口网址:http://www.cninfo.com.cn/cninfo-new/information/companylist 抓取目标:公司代码公司名称、公司公告地址python3.5完整代码如下:
import codecs
import csv
import requests
from bs4
import BeautifulSoup
def getHTML(url):
r = requests.get(url)
return r.text
def parseHTML(html):
soup = BeautifulSoup(html,
'html.parser')
body = soup.body
company_middle = body.find(
'div', attrs={
'class':
'middle'})
company_list_ct = company_middle.find(
'div', attrs={
'class':
'list-ct'})
company_list = []
for company_ul
in company_list_ct.find_all(
'ul', attrs={
'class':
'company-list'}):
for company_li
in company_ul.find_all(
'li'):
company_url = company_li.a[
'href']
company_info = company_li.get_text()
company_list.append([company_info, company_url])
return company_list
def writeCSV(file_name,data_list):
with codecs.open(file_name,
'w')
as f:
writer = csv.writer(f)
for data
in data_list:
writer.writerow(data)
URL =
'http://www.cninfo.com.cn/cninfo- new/information/companylist'
html = getHTML(URL)
data_list = parseHTML(html)
writeCSV(
'test.csv', data_list)
运行结果为在当前代码文件所在的文件夹生成一个test.csv文件,内容不完整截图如下:
原文地址:https://zhuanlan.zhihu.com/p/21452812
转载于:https://www.cnblogs.com/fanren224/p/8457236.html