我正在尝试构建一个刮擦Google Developer Console帐户的工具。当我运行蜘蛛时,它似乎成功登录并且日志正好。当我尝试请求另一个页面并将response.body写入文件时。它给出了以下内容( response.html ):
<!DOCTYPE html><html><head><title>Redirecting...</title><script type="text/javascript" language="javascript">var url = 'https:\/\/accounts.google.com\/ServiceLogin?service\x3dandroiddeveloper\x26passive\x3d1209600\x26continue\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14813004207305910035%23__HASH__\x26followup\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14813004207305910035'; var fragment = ''; if (self.document.location.hash) {fragment = self.document.location.hash.replace(/^#/,'');}url = url.replace(new RegExp("__HASH__", 'g'), encodeURIComponent(fragment));window.location.assign(url);</script><noscript><meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035&followup=https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035'"></meta></noscript></head><body></body></html>
所以基本上我理解它是一个没有正文和标题的简单html - &gt;重定向...
我假设蜘蛛在页面加载之前就已经爬行了。我研究并尝试将meta={'handle_httpstatus_list': [302],'dont_redirect': True}
添加到Request,似乎没有区别。
这是我的蜘蛛:
from scrapy.http import FormRequest, Request
import logging
import scrapy
class LoginSpider(scrapy.Spider):
name = 'super'
start_urls = ['https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/%23&followup=https://play.google.com/apps/publish/#identifier']
def parse(self, response):
return [FormRequest.from_response(response,
formdata={'Email': 'devaccnt@gmail.com', 'Passwd': 'devpwd'},
callback=self.after_login)]
def after_login(self, response):
if "wrong" in str(response.body):
self.log("Login failed", level=logging.ERROR)
return
# We've successfully authenticated, let's have some fun!
print("Login Successful!!")
return Request(url="https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace", meta={'handle_httpstatus_list': [302],
'dont_redirect': True},
callback=self.parse_tastypage)
def parse_tastypage(self, response):
print ("---------------------")
filename = 'response.html'
print(filename)
with open(filename, 'wb') as f:
f.write(response.body)
print ("---------------------")
**不介意缩进,它们在原始剧本中很好
答案 0 :(得分:1)
我认为发生的事实恰恰相反,即Scrapy 不遵循重定向。这是一个示例scrapy shell会话,您可以看到HTTP响应代码是200,而不是302:
$ scrapy shell 'https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace'
2017-02-07 10:30:45 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
(...)
2017-02-07 10:30:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace> (referer: None)
>>> print(response.text)
<!DOCTYPE html><html><head><title>Redirecting...</title><script type="text/javascript" language="javascript">var url = 'https:\/\/accounts.google.com\/ServiceLogin?service\x3dandroiddeveloper\x26passive\x3d1209600\x26continue\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14592564207369815%23__HASH__\x26followup\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14592564207369815'; var fragment = ''; if (self.document.location.hash) {fragment = self.document.location.hash.replace(/^#/,'');}url = url.replace(new RegExp("__HASH__", 'g'), encodeURIComponent(fragment));window.location.assign(url);</script><noscript><meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815&followup=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815'"></meta></noscript></head><body></body></html>
Scrapy不解释JavaScript,但它应该能够理解这一点:
<noscript>
<meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035&followup=https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035'">
</meta>
</noscript>
但事实并非如此。
负责此类元刷新重定向的框架部分是scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware
目前正在实施以查找不在<script>
或<noscript>
中的元刷新信息(请参阅scrapy.utils.response.get_meta_refresh
)
您可以使用自定义MetaRefreshMiddleware
更改此设置,该自定义<noscript>
也会在>>> w3lib.html.get_meta_refresh(response.text, response.url, response.encoding, ignore_tags=('script'))
(0.0, 'https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815&followup=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815')
元素中查找元刷新:
Android Fragments