下载PDF文件时出错

时间:2018-03-26 17:51:14

标签: scrapy

我有以下(简化)代码:

import os
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test_spider'
    start_urls = ['http://www.pdf995.com/samples/pdf.pdf', ]

    def parse(self, response):
        save_path = 'test'
        file_name = 'test.pdf'
        self.save_page(response, save_path, file_name)

    def save_page(self, response, save_dir, file_name):
        os.makedirs(save_dir, exist_ok=True)
        with open(os.path.join(save_dir, file_name), 'wb') as afile:
            afile.write(response.body)

当我运行它时,我收到此错误:

[scrapy.core.scraper] ERROR: Error downloading <GET http://www.pdf995.com/samples/pdf.pdf>
Traceback (most recent call last):
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
    result = g.send(result)
File "C:\Python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 43, in process_request
    defer.returnValue((yield download_func(request=request,spider=spider)))
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1278, in returnValue
    raise _DefGen_Return(val)
twisted.internet.defer._DefGen_Return: <200 http://www.pdf995.com/samples/pdf.pdf>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):   
File "C:\Python36\lib\site-packages\twisted\internet\defer.py", line 1301, in _inlineCallbacks
    result = g.send(result)   
File "C:\Python36\lib\site-packages\scrapy\core\downloader\middleware.py", line 53, in process_response
    spider=spider)   
File "C:\Python36\lib\site-packages\scrapy_beautifulsoup\middleware.py", line 16, in process_response
    return response.replace(body=str(BeautifulSoup(response.body, self.parser)))   
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 79, in replace
    return cls(*args, **kwargs)   
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 20, in __init__
    self._set_body(body)   
File "C:\Python36\lib\site-packages\scrapy\http\response\__init__.py", line 55, in _set_body
    "Response body must be bytes. " 
TypeError: Response body must be bytes. If you want to pass unicode body use TextResponse or HtmlResponse.

我是否需要引入中间件或其他东西来处理这个问题?这似乎应该有效,至少是other examples

注意:目前我没有使用管道,因为在我真正的蜘蛛中我有很多关于相关项目是否已被删除的检查,验证此pdf是否属于该项目,并检查自定义名称一个pdf,看看它是否已下载。如上所述,许多样本做了我在这里所做的事情,所以我认为它会更容易和工作。

1 个答案:

答案 0 :(得分:1)

问题是因为您自己的scrapy_beautifulsoup\middleware.py试图替换return response.replace(body=str(BeautifulSoup(response.body, self.parser)))

你需要纠正这个问题,这应该解决问题