Python Scrapy:从本地文件爬网:Content-Type undefined

时间:2017-05-27 16:54:58

标签: python-3.x file scrapy

我想让Scrapy抓取本地html文件,但由于标头缺少Content-type字段而被卡住了。我在这里遵循了教程:Use Scrapy to crawl local XML file - Start URL local file address所以基本上,我将scrapy指向本地网址,例如file:///Users/felix/myfile.html

然而,scrapy会崩溃,因为看起来(在MacOS上)生成的响应对象不包含必需的字段Content-type

/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/felix/IdeaProjects/news-please/newsplease/__init__.py
[scrapy.core.scraper:158|ERROR] Spider error processing <GET file:///Users/felix/IdeaProjects/news-please/newsplease/0a2199bdcef84d2bb2f920cf042c5919> (referer: None)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
    yield next(it)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
    for x in result:
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "/Users/felix/IdeaProjects/news-please/newsplease/crawler/spiders/download_crawler.py", line 33, in parse
    if not self.helper.parse_crawler.content_type(response):
  File "/Users/felix/IdeaProjects/news-please/newsplease/helper_classes/parse_crawler.py", line 116, in content_type
    if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
AttributeError: 'NoneType' object has no attribute 'decode'

有人建议运行一个简单的http服务器,请参阅Python Scrapy on offline (local) data,但这不是一个选项,主要是因为运行另一台服务器造成的开销。

我首先需要使用scrapy,因为我们有一个使用scrapy的更大的框架。我们计划添加从本地文件爬行到该框架的功能。但是,由于关于如何从本地文件爬网有几个问题(参见前面的链接),我认为这个问题是普遍感兴趣的。

1 个答案:

答案 0 :(得分:2)

如果来自本地存储空间,您可以在return True中的def content_type(self, response)函数中实际分叉新闻 - 或将scrapy更改为newsplease/helper_classes/parse_crawler.py

新文件如下所示:

def content_type(self, response):
    """
    Ensures the response is of type

    :param obj response: The scrapy response
    :return bool: Determines wether the response is of the correct type
    """
    if response.url.startswith('file:///'):
        return True
    if not re.match('text/html', response.headers.get('Content-Type').decode('utf-8')):
        self.log.warn("Dropped: %s's content is not of type "
                      "text/html but %s", response.url,
                      response.headers.get('Content-Type'))
        return False
    else:
        return True