到百度云加速，网页内容爬不到的快速解决

更新日期: 2019-02-01 阅读: 4.6k 标签: 爬虫分享

在爬网站时，发现网站做了百度云加速，每次访问首页时要求输入验证码，才能打开网站首页。没采用网上自动解析验证码图片的方案，快过年了，不想PIP，快速解决快速回家

经过分析网站，发现如果你拿到一个当期可用的Cooikes后，你就可以一直爬数据，且并不会触发百度验证输入

代码如下（注意：代码中的网址、Cookies都是假的，如果想用代码，把你自己的网址和Cookies替换上

import request 
from datetime import datetime, timedelta
from scrapy.selector import Selector

s=requests.session()
headers = {
            'cookie': '__cfduid=134343474e8d3f723cae541fb7d7f6b01f1546501720; _ga=GA1.2.573376275.1546501778; _gid=GA1.2.543022193.1549014020; cf_clearance=b19851c48ae560c62485879ac37a257a3f12df1e-1549086155-1800-250; ',
            'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/536.34 (Khtml, like Gecko) Chrome/71.0.3578.98 Safari/537.34',
}

url = 'https://www.samle.com/news/page/2/'
res = requests.get(url,headers=headers)
hxs = Selector(res)
#print(res.text)
datePub = hxs.xpath('//main[@class="content"]//time/text()').extract()
#print (datePub)
links = hxs.xpath('//main[@class="content"]//h2/a')
for index, link in enumerate(links):
            pubDateStr = datePub[index]
            pubDateStr = pubDateStr.strip()
            
            url = ''.join(link.xpath('./@href').extract())
            item_pubDateStr = datetime.strftime(pubDate, '%Y-%m-%d')
            item_res = requests.get(url, headers=headers)
            item_hxs = Selector(item_res)
            item_title = item_hxs.xpath("//h2/text()").extract()
            item_content = item_hxs.xpath("//main//div[@class='econtent']/p//text()").extract()
            item_datePublish =item_pubDateStr
            item_linkAddress = url
            filename = datetime.now().strftime('%Y%m%d%H%M%S%f')+".txt"
            str_result = '{"linkAddress":["'+url+'"],'
            str_result = str_result + '"title":["'+item_title[0]+'"],'
            str_result = str_result + '"datePublish": ["'+item_pubDateStr+'"],'
            if len(item_content)>1:
                str_result = str_result + '"content": ["'+item_content[0]+'"]}'
            else:
                str_result = str_result + '"content": ["' + "" + '"]}'

            if len(str_result) >0:
                with open(filename, 'w',encoding='utf-8') as f:
                    f.write(str_result)
                    print(item_title)

如何获取当期可用的Cookies的方法：

打开Chorme，打开“开发者工具”（按F12）

访问网址后

去开发发工具里的“Network”Tab页里去找它的Cookies！

本文内容仅供个人学习/研究/参考使用，不构成任何决策建议或专业指导。分享/转载时请标明原文来源，同时请勿将内容用于商业售卖、虚假宣传等非学习用途哦～感谢您的理解与支持！

链接: https://fly63.com/article/detial/1950

到百度云加速，网页内容爬不到的快速解决

web爬虫抓取技术的门道,对于网络爬虫技术的攻与防

大话爬虫的实践技巧

网络爬虫_基于各种语言的开源网络爬虫总汇

爬虫最终杀手锏 --- PhantomJS 详解（附案例）

什么是网络爬虫，网络爬虫有什么用？

爬虫解决网页重定向问题

爬虫解决网页ip限制的问题的八种方法

反爬经验与理论基础

node.js主从分布式爬虫

使用Node.js爬取任意网页资源并输出高质量PDF文件到本地

到百度云加速，网页内容爬不到的快速解决

web爬虫抓取技术的门道,对于网络爬虫技术的攻与防

大话爬虫的实践技巧

网络爬虫_基于各种语言的开源网络爬虫总汇

爬虫最终杀手锏 --- PhantomJS 详解（附案例）

什么是网络爬虫，网络爬虫有什么用？

爬虫解决网页重定向问题

爬虫 解决网页ip限制的问题的八种方法

反爬经验与理论基础

node.js主从分布式爬虫

使用Node.js爬取任意网页资源并输出高质量PDF文件到本地

爬虫解决网页ip限制的问题的八种方法