所以我一直在尝试使用管道中的请求库来尝试下载PDF。但是,下载时PDF始终为0字节。我得到一个错误:错误处理URL,追溯到我提出原始请求的行,虽然我似乎无法根据我之前阅读过的问题和文档找到错误。 URL通过“项URL”字段传入。谁能告诉我我做错了什么。
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import requests
class SherlockPipeline(object):
def process_item(self,item, spider):
#pdf_url = []
#pdf_url.append(item['url'])
pdf_url = item['url']
local_filename = item['url'].split('/')[-1]
request = requests.request('GET', pdf_url)
with open(local_filename, 'wb' ) as f:
for buff in request.iter_content(128):
f.write(buff)#write it to file
return item
日志输出如下:
2015-12-03 17:42:42 [requests.packages.urllib3.connectionpool] INFO: Starting new HTTP connection (1): www.privacybydesign.ca
2015-12-03 17:42:42 [requests.packages.urllib3.connectionpool] DEBUG: "GET/content/uploads/2014/09/pbd-de-identifcation-essential.pdf HTTP/1.1" 302 275
2015-12-03 17:42:42 [requests.packages.urllib3.connectionpool] INFO:Starting new HTTPS connection (1): www.privacybydesign.ca
2015-12-03 17:42:43 [scrapy] ERROR: Error processing {'url': u'http://www.privacybydesign.ca/content/uploads/2014/09/pbd-de-identifcation-essential.pdf'}
Traceback (most recent call last): File "C:\Users\pinky\Anaconda2\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\pinky\Documents\work\sherlock\sherlock\pipelines.py", line 16, in process_item
request = requests.request('GET', pdf_url)
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\sessions.py", line 597, in send
history = [resp for resp in gen] if allow_redirects else []
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\sessions.py", line 195, in resolve_redirects
**adapter_kwargs
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "C:\Users\pinky\Anaconda2\lib\site-packages\requests\adapters.py", line 433, in send
raise SSLError(e, request=request)
答案 0 :(得分:1)
Scrapy已经有download files and images的管道,效果很好。
答案 1 :(得分:0)
尝试将requests
stream
关键字参数用作 True :
request = requests.get(pdf_url, stream=True)
另外,正如您所看到的,没有必要提出这样的请求:
request = requests.request('GET', pdf_url, stream=True)
有关stream
的详情,请参阅:
http://docs.python-requests.org/en/latest/user/advanced/#body-content-workflow
答案 2 :(得分:0)
因此,我无法改变管道中的任何内容来解决问题,所以我更改了蜘蛛以生成字典而不是项目,这很有效,但我不确定原因。