抓取后如何压缩和清理下载的文件

时间:2018-12-02 11:16:25

标签: python scrapy

我已经成功创建了一个带有scrapy的爬虫程序,可以将其下载到CSV并将图像拉入图像/完整文件夹。

现在,我想在爬网之后通过将文件拉入zip存档并删除“完整”文件夹和CSV文件来对其进行清理。

这是我的方法:

parser_attributes.py:

# -*- coding: utf-8 -*-

# interpret attributes
def gender(i):
    switcher={
            'damen & herren' : 1,
            'herren, unisex' : 1,
            'unisex' : 1,
            'damen' : 2,
            'herren' : 6
         }
    for k, v in switcher.items():
        if k.lower() in i.lower():
            return v
    return "Invalid: " + i

test.py:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
import logging
# from urllib.parse import urlparse
import os
import zipfile
import shutil # zip archive generator
import datetime

# images
from scrapy.pipelines.images import ImagesPipeline

from bid.items import myitem

# import translater for attributes
# would rather use import parser_attributes but could not get this working with e.g. import parser_attributes or import bid.parser_attributes
exec(open("/Volumes/zero/Users/user/test_crawl/bid/bid/spiders/parser_attributes.py").read())


class GetbidSpider(CrawlSpider):
    # create a spider
    # some more code here ...
    rules = (
        Rule(
            LinkExtractor(allow=['rule']), 
            callback='parse_item'
        ),
    )
    def parse_item(self, response):

        ### do something

        return myitem

def cleanup(name):
    # create zip archive with all images inside
    filename = '/Users/user/test_crawl/bid/zip/test_' +  datetime.datetime.now().strftime ("%Y%m%d-%H%m%S")
    imagefolder = '/Users/user/test_crawl/bid/images/full'
    shutil.make_archive(filename, 'zip', imagefolder) 
    # delete images
    shutil.rmtree(imagefolder)

    # add csv file to  zip archive
    filename_zip = filename + '.zip'
    zip = zipfile.ZipFile(filename_zip,'a') 
    path_to_file = '/Users/user/test_crawl/bid/csv/181201_test.csv'
    zip.write(path_to_file, os.path.basename(path_to_file)) # add file under same directory without folder names
    zip.close()

cleanup('name')

跟踪:

 scrapy crawl test -o csv/181201_test.csv -t csv 
Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 11, in <module>
    sys.exit(execute())
  File "/usr/local/lib/python3.7/site-packages/scrapy/cmdline.py", line 149, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 249, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 137, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/crawler.py", line 336, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "/usr/local/lib/python3.7/site-packages/scrapy/spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "/usr/local/lib/python3.7/site-packages/scrapy/spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "/usr/local/lib/python3.7/site-packages/scrapy/spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "/usr/local/lib/python3.7/site-packages/scrapy/utils/misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/user/test_crawl/bid/bid/spiders/test.py", line 96, in <module>
    cleanup('test')
  File "/Users/user/test_crawl/bid/bid/spiders/test.py", line 84, in cleanup
    shutil.make_archive(filename, 'zip', imagefolder) 
  File "/usr/local/Cellar/python/3.7.1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/shutil.py", line 792, in make_archive
    os.chdir(root_dir)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/user/test_crawl/bid/images/full'

1 个答案:

答案 0 :(得分:0)

执行刮spider蜘蛛的主要方法有两种:

您的代码尝试将两种方式混合使用,这是行不通的。

我可以通过两种方法来做自己想做的事情:

  • 创建一个custom pipeline并使用其close_spider方法处理您的压缩/删除逻辑
  • 创建一个自定义的storageexporter,将信息存储在一个zip文件中,

前者可能更简单,但后者避免了在抓取过程完成后压缩和删除文件的需要。

# import translater for attributes
# would rather use import parser_attributes but could not get this working with e.g. import parser_attributes or import bid.parser_attributes
exec(open("/Volumes/zero/Users/user/test_crawl/bid/bid/spiders/parser_attributes.py").read())

您只需要使用正确的绝对导入路径即可。
这可能会达到相同的目的,但是要以一种不太可怕的方式实现:

from bid.spiders.parser_attributes import gender