Question

我需要使用FTP中的scrapy下载一组csv。但首先我需要抓一个网站（https://www.douglas.co.us/assessor/data-downloads/）以获取ftp中的csv网址。我读到了如何下载文档中的文件（Downloading and processing files and images）

设置

custom_settings = {
        'ITEM_PIPELINES': {
            'scrapy.pipelines.files.FilesPipeline': 1, 


        },
        'FILES_STORE' : os.path.dirname(os.path.abspath(__file__))
    }

解析

def parse(self, response):
        self.logger.info("In parse method!!!")
        # Property Ownership        
        property_ownership = response.xpath("//a[contains(., 'Property Ownership')]/@href").extract_first()

        # Property Location
        property_location = response.xpath("//a[contains(., 'Property Location')]/@href").extract_first()

        # Property Improvements
        property_improvements = response.xpath("//a[contains(., 'Property Improvements')]/@href").extract_first()

        # Property Value
        property_value = response.xpath("//a[contains(., 'Property Value')]/@href").extract_first()

        item = FiledownloadItem()
        self.insert_keyvalue(item,"file_urls",[property_ownership, property_location, property_improvements, property_value])

        yield item

但我收到以下错误

Traceback（最近一次调用最后一次）：文件 “/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py” 第67行，在_runCallbacks中 current.result = callback（current.result，* args，** kw）File“/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py”，第79行，在process_item中 requests = arg_to_iter（self.get_media_requests（item，info））文件“/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/files.py”，第382行，在get_media_requests中在item.get中返回[Request（x）for x（self.files_urls_field，[]）]文件 “/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py” 第25行，在 init 中 self._set_url（url）文件“/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py”，第58行，在_set_url中引发ValueError（'请求url中缺少方案：％s'％self._url）ValueError：请求URL中缺少方案：[

对我的问题的最佳解释是answer这个问题的scrapy error :exceptions.ValueError: Missing scheme in request url:，它解释了问题是下载的网址缺少“http：//”。

我的情况怎么办？我可以使用FilesPipeline吗？或者我需要做些不同的事情？

提前致谢。

Answer 1

ValueError（'请求网址中缺少方案：％s'％self._url） ValueError：请求网址中缺少方案：[

根据追溯，scrapy认为你的文件网址是'['。
我最好的猜测是你在insert_keyvalue()方法中有错误另外，为什么有这个方法呢？简单的任务应该有效。

Scrapy从FTP下载文件

1 个答案: