Question

我有一个进程（Scrapy外部），它生成一个pdf文档的URL列表，以及一个列表文件路径，我想保存每个pdf。

The following解释了如何将URL列表作为命令行参数传递给Scrapy，但是，有没有办法传递文件路径并确保每个pdf都保存在提供的文件路径中？

我怀疑我需要根据文档中提供的the tutorial修改以下内容，但据我了解，parse方法用于确定如何处理一个响应，并且不处理列表。

def parse(self, response):
    filename = response.url.split("/")[-2] + '.html'
    with open(filename, 'wb') as f:
        f.write(response.body)

有什么建议吗？

Answer 1

原来这是一个与python相关的问题，与Scrapy本身无关。以下结果证明是我追求的解决方案。

# To run;    
# > scrapy runspider pdfGetter.py -a urlList=/path/to/file.txt -a pathList=/path/to/another/file.txt

import scrapy
class pdfGetter(scrapy.Spider):
    name = "pdfGetter"

    def __init__(self,urlList='',pathList=''):
        self.File=open(urlList)
        self.start_urls = [url.strip() for url in self.urlFile.readlines()]
        self.File.close()

        self.File=open(pathList)
        self.save_urls = [path.strip() for path in self.pathFile.readlines()]
        self.File.close()

    def parse(self, response):
        idx = self.start_urls.index(response.url)
        with open(self.save_urls[idx], 'wb') as f:
            f.write(response.body)

Answer 2

如果我是正确的，你不能用scrapy“抓取”pdf，但如果你想保存pdf，你不需要抓取它，你只需要url，所以例如：

import urllib
from scrapy import Spider

class MySpider(Spider):
    name = "myspider"
    start_urls = ['http://website-that-contains-pdf-urls']

    def parse(self, response):
        urls = response.xpath('//xpath/to/url/@href').extract()
        for url in urls:
            urllib.urlretrieve(url, filename="name-of-my-file.pdf")

如何将输出文件路径列表传递给Scrapy？

2 个答案: