使用请求和scrapy下载PDF

时间:2015-12-03 22:56:49

标签: python pdf scrapy python-requests

所以我一直在尝试使用管道中的请求库来尝试下载PDF。但是,下载时PDF始终为0字节。我得到一个错误:错误处理URL,追溯到我提出原始请求的行,虽然我似乎无法根据我之前阅读过的问题和文档找到错误。 URL通过“项URL”字段传入。谁能告诉我我做错了什么。

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import requests

class SherlockPipeline(object):

 def process_item(self,item, spider):
    #pdf_url = []
    #pdf_url.append(item['url'])
    pdf_url = item['url']
    local_filename = item['url'].split('/')[-1]
    request = requests.request('GET', pdf_url)
    with open(local_filename, 'wb' ) as f:
       for buff in request.iter_content(128): 
            f.write(buff)#write it to file
    return item

日志输出如下:

2015-12-03 17:42:42 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): www.privacybydesign.ca
2015-12-03 17:42:42 [requests.packages.urllib3.connectionpool] DEBUG: "GET/content/uploads/2014/09/pbd-de-identifcation-essential.pdf HTTP/1.1" 302 275
2015-12-03 17:42:42 [requests.packages.urllib3.connectionpool] INFO:Starting new HTTPS connection (1): www.privacybydesign.ca
2015-12-03 17:42:43 [scrapy] ERROR: Error processing {'url': u'http://www.privacybydesign.ca/content/uploads/2014/09/pbd-de-identifcation-essential.pdf'}
Traceback (most recent call last): File "C:\Users\pinky\Anaconda2\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
  current.result = callback(current.result, *args, **kw)
File "C:\Users\pinky\Documents\work\sherlock\sherlock\pipelines.py", line 16, in process_item
  request = requests.request('GET', pdf_url)
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\api.py", line 50, in request
   response = session.request(method=method, url=url, **kwargs)
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\sessions.py", line 468, in request
   resp = self.send(prep, **send_kwargs)
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\sessions.py", line 597, in send
   history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\sessions.py", line 195, in resolve_redirects
   **adapter_kwargs
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\sessions.py", line 576, in send
  r = adapter.send(request, **kwargs)
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\adapters.py", line 433, in send
  raise SSLError(e, request=request)

3 个答案:

答案 0 :(得分:1)

Scrapy已经有download files and images的管道,效果很好。

答案 1 :(得分:0)

尝试将requests stream关键字参数用作 True

request = requests.get(pdf_url, stream=True)

另外,正如您所看到的,没有必要提出这样的请求:

request = requests.request('GET', pdf_url, stream=True)

有关stream的详情,请参阅:

http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow

答案 2 :(得分:0)

因此,我无法改变管道中的任何内容来解决问题,所以我更改了蜘蛛以生成字典而不是项目,这很有效,但我不确定原因。