Python Scrapy:响应主体只显示重定向

时间:2017-02-07 07:55:25

标签: python html scrapy scrapy-spider

我正在尝试构建一个刮擦Google Developer Console帐户的工具。当我运行蜘蛛时,它似乎成功登录并且日志正好。当我尝试请求另一个页面并将response.body写入文件时。它给出了以下内容( response.html ):

  

<!DOCTYPE html><html><head><title>Redirecting...</title><script type="text/javascript" language="javascript">var url = 'https:\/\/accounts.google.com\/ServiceLogin?service\x3dandroiddeveloper\x26passive\x3d1209600\x26continue\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14813004207305910035%23__HASH__\x26followup\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14813004207305910035'; var fragment = ''; if (self.document.location.hash) {fragment = self.document.location.hash.replace(/^#/,'');}url = url.replace(new RegExp("__HASH__", 'g'), encodeURIComponent(fragment));window.location.assign(url);</script><noscript><meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service&#61;androiddeveloper&amp;passive&#61;1209600&amp;continue&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035&amp;followup&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035'"></meta></noscript></head><body></body></html>

所以基本上我理解它是一个没有正文和标题的简单html - &gt;重定向...

我假设蜘蛛在页面加载之前就已经爬行了。我研究并尝试将meta={'handle_httpstatus_list': [302],'dont_redirect': True}添加到Request,似乎没有区别。

这是我的蜘蛛:

from scrapy.http import FormRequest, Request
import  logging
import scrapy

class LoginSpider(scrapy.Spider):
    name = 'super'
    start_urls = ['https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/%23&followup=https://play.google.com/apps/publish/#identifier']

def parse(self, response):
    return [FormRequest.from_response(response,
                formdata={'Email': 'devaccnt@gmail.com', 'Passwd': 'devpwd'},

                callback=self.after_login)]

def after_login(self, response):
    if "wrong" in str(response.body):
        self.log("Login failed", level=logging.ERROR)
        return
# We've successfully authenticated, let's have some fun!
    print("Login Successful!!")
    return Request(url="https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace", meta={'handle_httpstatus_list': [302],
                           'dont_redirect': True},
           callback=self.parse_tastypage)


def parse_tastypage(self, response):
    print ("---------------------")
    filename = 'response.html'
    print(filename)
    with open(filename, 'wb') as f:
        f.write(response.body)
    print ("---------------------")

**不介意缩进,它们在原始剧本中很好

1 个答案:

答案 0 :(得分:1)

我认为发生的事实恰恰相反,即Scrapy 不遵循重定向。这是一个示例scrapy shell会话,您可以看到HTTP响应代码是200,而不是302:

$ scrapy shell 'https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace'
2017-02-07 10:30:45 [scrapy.utils.log] INFO: Scrapy 1.3.0 started (bot: scrapybot)
(...)
2017-02-07 10:30:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://play.google.com/apps/publish/?dev_acc=14592564207369815#AppListPlace> (referer: None)
>>> print(response.text)
<!DOCTYPE html><html><head><title>Redirecting...</title><script type="text/javascript" language="javascript">var url = 'https:\/\/accounts.google.com\/ServiceLogin?service\x3dandroiddeveloper\x26passive\x3d1209600\x26continue\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14592564207369815%23__HASH__\x26followup\x3dhttps:\/\/play.google.com\/apps\/publish\/?dev_acc%3D14592564207369815'; var fragment = ''; if (self.document.location.hash) {fragment = self.document.location.hash.replace(/^#/,'');}url = url.replace(new RegExp("__HASH__", 'g'), encodeURIComponent(fragment));window.location.assign(url);</script><noscript><meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service&#61;androiddeveloper&amp;passive&#61;1209600&amp;continue&#61;https://play.google.com/apps/publish/?dev_acc%3D14592564207369815&amp;followup&#61;https://play.google.com/apps/publish/?dev_acc%3D14592564207369815'"></meta></noscript></head><body></body></html>

Scrapy不解释JavaScript,但它应该能够理解这一点:

<noscript>
<meta http-equiv="refresh" content="0; url='https://accounts.google.com/ServiceLogin?service&#61;androiddeveloper&amp;passive&#61;1209600&amp;continue&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035&amp;followup&#61;https://play.google.com/apps/publish/?dev_acc%3D14813004207305910035'">
</meta>
</noscript>

但事实并非如此。

负责此类元刷新重定向的框架部分是scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware

目前正在实施以查找不在<script><noscript>中的元刷新信息(请参阅scrapy.utils.response.get_meta_refresh

您可以使用自定义MetaRefreshMiddleware更改此设置,该自定义<noscript>也会在>>> w3lib.html.get_meta_refresh(response.text, response.url, response.encoding, ignore_tags=('script')) (0.0, 'https://accounts.google.com/ServiceLogin?service=androiddeveloper&passive=1209600&continue=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815&followup=https://play.google.com/apps/publish/?dev_acc%3D14592564207369815') 元素中查找元刷新:

Android Fragments