http://www.pss-system.gov.cn/sipopublicsearch/portal/uilogin-forwardLogin.shtml
登录此类网站的关键是识别其中的验证码。那么如何识别验证码呢。我们首先来看下网页源代码。在网页中,验证码的是通过下载一个图片得到的。图片的下载地址是src=/sipopublicsearch/portal/login-showPic.shtml 我们从实际的fiddler抓包来看,也是通过请求上面的图片源地址得到了JPEG的图片并显示在浏览器上 那么在scrapy中我们首先就要将图片下载到本地,然后进行识别 def parse(self,response): ret=response.xpath('//*[@id="codePic"]/@src').extract() image_source=ret[0] image_url=response.urljoin(image_source) r=requests.get(image_url) with open('E://scrapy_project/image2.JPEG',"wb") as code: code.write(r.content) 首先提取src的值出来,然后使用requests的方法进行图片下载并保存。打开文件如下。 下一步就是开始识别图片中的验证码了,这就需要用到pytesser以及PIL库了。 首先是安装Tesseract-OCR,在网上下载后进行安装。默认安装路径是C:\Program Files\Tesseract-OCR。将该路径添加到 系统属性的path路径里面。 然后再通过pip安装pytesseract以及PIL。下面来看下如何使用。代码如下: im=Image.open('E:\\scrapy_project\\image2.JPEG') im.convert('L') ret=image_to_string(im,config='-psm 7’)print ret 结果如下:图片中的验证码已经被识别出来了 image_to_string要配置psm N,参数解释如下,一般我们选择第7个 -psm N Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR. 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character. E:\python2.7.11\python.exe E:/py_prj/test3.py 8227
转载于:https://www.cnblogs.com/zhanghongfeng/p/8325340.html
相关资源:Python-知乎爬虫验证码自动识别