Question

我正在尝试构建一个刮擦Google Developer Console帐户的工具。当我运行蜘蛛时，它似乎成功登录并且日志正好。当我尝试请求另一个页面并将response.body写入文件时。它给出了以下内容（ response.html ）：

<!DOCTYPE html><html><head><title>Redirecting...</title><script type="text/javascript" language="javascript">var url = 'https:\/\/accounts.google.com\/ServiceLogin?service\x3dandroiddeveloper\x26passive\x3d1209600\x26continue\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14813004207305910035%23__HASH__\x26followup\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14813004207305910035'; var fragment = ''; if (self.document.location.hash) {fragment = self.document.location.hash.replace(/^#/,'');}url = url.replace(new RegExp("__HASH__", 'g'), encodeURIComponent(fragment));window.location.assign(url);</script><noscript><meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035&followup=https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035'"></meta></noscript></head><body></body></html>

所以基本上我理解它是一个没有正文和标题的简单html - ＆gt;重定向...

我假设蜘蛛在页面加载之前就已经爬行了。我研究并尝试将meta={'handle_httpstatus_list': [302],'dont_redirect': True}添加到Request，似乎没有区别。

这是我的蜘蛛：

from scrapy.http import FormRequest, Request
import  logging
import scrapy

class LoginSpider(scrapy.Spider):
    name = 'super'
    start_urls = ['https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/%23&followup=https://play.google.com/apps/publish/#identifier']

def parse(self, response):
    return [FormRequest.from_response(response,
                formdata={'Email': 'devaccnt@gmail.com', 'Passwd': 'devpwd'},

                callback=self.after_login)]

def after_login(self, response):
    if "wrong" in str(response.body):
        self.log("Login failed", level=logging.ERROR)
        return
# We've successfully authenticated, let's have some fun!
    print("Login Successful!!")
    return Request(url="https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace", meta={'handle_httpstatus_list': [302],
                           'dont_redirect': True},
           callback=self.parse_tastypage)


def parse_tastypage(self, response):
    print ("---------------------")
    filename = 'response.html'
    print(filename)
    with open(filename, 'wb') as f:
        f.write(response.body)
    print ("---------------------")

**不介意缩进，它们在原始剧本中很好

Answer 1

我认为发生的事实恰恰相反，即Scrapy 不遵循重定向。这是一个示例scrapy shell会话，您可以看到HTTP响应代码是200，而不是302：

$ scrapy shell 'https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace'
2017-02-07 10:30:45 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
(...)
2017-02-07 10:30:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace> (referer: None)
>>> print(response.text)
<!DOCTYPE html><html><head><title>Redirecting...</title><script type="text/javascript" language="javascript">var url = 'https:\/\/accounts.google.com\/ServiceLogin?service\x3dandroiddeveloper\x26passive\x3d1209600\x26continue\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14592564207369815%23__HASH__\x26followup\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14592564207369815'; var fragment = ''; if (self.document.location.hash) {fragment = self.document.location.hash.replace(/^#/,'');}url = url.replace(new RegExp("__HASH__", 'g'), encodeURIComponent(fragment));window.location.assign(url);</script><noscript><meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service&#61;androiddeveloper&amp;passive&#61;1209600&amp;continue&#61;https://play.google.com/apps/publish/?dev_acc%3D14592564207369815&amp;followup&#61;https://play.google.com/apps/publish/?dev_acc%3D14592564207369815'"></meta></noscript></head><body></body></html>

Scrapy不解释JavaScript，但它应该能够理解这一点：

<noscript>
<meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service&#61;androiddeveloper&amp;passive&#61;1209600&amp;continue&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035&amp;followup&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035'">
</meta>
</noscript>

但事实并非如此。

负责此类元刷新重定向的框架部分是scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware

目前正在实施以查找不在<script>或<noscript>中的元刷新信息（请参阅scrapy.utils.response.get_meta_refresh）

您可以使用自定义MetaRefreshMiddleware更改此设置，该自定义<noscript>也会在>>> w3lib.html.get_meta_refresh(response.text, response.url, response.encoding, ignore_tags=('script')) (0.0, 'https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815&followup=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815')元素中查找元刷新：

Android Fragments

Python Scrapy：响应主体只显示重定向

1 个答案: