文件较大时CSV文件出现问题

时间:2018-12-27 17:39:24

标签: python file csv sorting

我有一个较大的tsv文件(〜2.5Gb)。我遍历该行有6个选项卡的每一行。我将每一行的第一个选项卡添加到基于此第一个选项卡的csv文件中。目标是以基于tsv主行的csv文件排序的文件结尾。

这在小型文件上有效,但是当我在大型文件上运行时,IPython控制台永远不会结束。我要保存的文件看起来好像已被填充,但是当我打开它时,什么都没有显示。

export default function asyncRoute<R, S> (fn: Fn<R, S>) { 
  return function asyncRouteWrap(req: R, res: S, next: NextFunction) {
    Promise.resolve(fn(req, res, next)).catch(next);
  };
};

2 个答案:

答案 0 :(得分:2)

您的代码在打开并为其处理的输入文件的行/行添加数据的意义上是非常低效的,如果输入文件是巨大(因为这样做所需的OS调用相对较慢)。

另外,我注意到您的代码中至少存在一个错误-即该行:

save_path += cik + ".csv"

这只会使save_path越来越长……这不是必需的。

无论如何,这应该可以更快地运行,尽管处理如此大的文件可能仍需要相当长的时间。它通过缓存中间结果来加快过程。这样做是通过仅打开不同的输出csv文件并尽可能少地创建它们相应的csv.writer对象来实现的,这是第一次使用它们,并且仅在由于高速缓存达到其最大长度而关闭它们时再次创建它们。 >

请注意,缓存可能会自己消耗大量内存,具体取决于有多少个独特的csv输出文件以及可以同时打开其中的多少个csv输出文件-但是使用大量内存可以使它运行得更快。您需要四处游动并手动调整MAX_OPEN的值,以便在速度和内存使用之间找到最佳平衡,同时保持在操作系统允许一次打开多少文件的限制之下

还请注意,通过更智能地选择要关闭的现有文件条目,而不是随机选择(打开)一个文件,可能可以使其工作效率更高。但是,这样做是否真正有帮助取决于输入文件中数据是否有任何有利的分组或其他顺序。

import csv
import os
import random

class CSVWriterCache(dict):
    """ Dict subclass to cache pairs of csv files and associated
        csv.writers. When a specified maximum number of them already
        exist, a random one closed, but an entry for it is retained
        and marked "closed" so it can be re-opened in append mode
        later if it's ever referenced again. This limits the number of
        files open at any given time.
    """
    _CLOSED = None  # Marker to indicate that file has seen before.

    def __init__(self, max_open, **kwargs):
        self.max_open = max_open
        self.cur_open = 0  # Number of currently opened csv files.
        self.csv_kwargs = kwargs  # keyword args for csv.writer.

    # Adding the next two non-dict special methods makes the class a
    # context manager which allows it to be used in "with" statements
    # to do automatic clean-up.
    def __enter__(self):
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.close()

    def __getitem__(self, k):
        if k not in self:
            return self.__missing__(k)
        else:
            try:
                csv_writer, csv_file = self.get(k)
            except TypeError:  # Needs to be re-opened in append mode.
                csv_file = open(k, 'a', newline='')
                csv_writer = csv.writer(csv_file, **self.csv_kwargs)

            return csv_writer, csv_file

    def __missing__(self, csv_file_path):
        """ Create a csv.writer corresponding to the file path and add it
            and the file to the cache.
        """
        if self.cur_open == self.max_open:  # Limit?
            # Randomly choose a cached entry with a previously seen
            # file path that is still open (not _CLOSED). The associated
            # file is then closed, but the entry for the file path is
            # left in the dictionary so it can be recognized as having
            # been seen before and be re-opened in append mode.
            while True:
                rand_entry = random.choice(tuple(self.keys()))
                if self[rand_entry] is not self._CLOSED:
                    break
            csv_writer, csv_file = self[rand_entry]
            csv_file.close()
            self.cur_open -= 1
            self[rand_entry] = self._CLOSED  # Mark as previous seen but closed.

        csv_file = open(csv_file_path, 'w', newline='')
        csv_writer = csv.writer(csv_file, **self.csv_kwargs)
        self.cur_open += 1

        # Add pair to cache.
        super().__setitem__(csv_file_path, (csv_writer, csv_file))
        return csv_writer, csv_file

    # Added, non-standard dict method.
    def close(self):
        """ Close all the opened files in the cache and clear it out. """
        for key, entry in self.items():
            if entry is not self._CLOSED:
                entry[1].close()
                self[key] = self._CLOSED  # Not strictly necessary.
                self.cur_open -= 1  # For sanity check at end.
        self.clear()
        assert(self.cur_open == 0)  # Sanity check.

if __name__ == '__main__':
    file_path = "./master.tsv"
    save_path = "./data-sorted"
    MAX_OPEN  = 1000  # Number of opened files allowed (max is OS-dependent).
#    MAX_OPEN  = 2  # Use small value for testing.

    # Create output directory if it does not exist.
    if os.path.exists(save_path):
        if not os.path.isdir(save_path):
            raise RuntimeError("Path {!r} exists, but isn't a directory")
    else:
        print('Creating directory: {!r}'.format(save_path))
        os.makedirs(save_path)

    # Process the input file using a cache of csv.writers.
    with open(file_path, 'r') as masterfile, \
         CSVWriterCache(MAX_OPEN, quoting=csv.QUOTE_ALL) as csv_writer_cache:
        for line in masterfile:
            line_split = line.rstrip().split("|")
            cik = line_split[0].zfill(10)

            save_file_path = os.path.join(save_path, cik + ".csv")
            writer = csv_writer_cache[save_file_path][0]
            writer.writerow(line_split)

    print('{!r} file processing completed'.format(os.path.basename(file_path)))

答案 1 :(得分:0)

假设您有足够的RAM,最好对内存中的文件进行排序,例如放入字典,然后一次写入磁盘。如果I / O确实是您的瓶颈,那么一次打开一个输出文件应该会带来很多麻烦。

from collections import defaultdict
from os.path import join

file_path = ".../master.tsv"

data = collections.defaultdict(list)
with open(file_path, 'r') as masterfile:
    for line in masterfile:
        cik = line.split("|", 1)[0].zfill(10)
        data[cik].append(line)

for cik, lines in data.items():
    save_path = join(".../data-sorted", cik + ".csv")

    with open(save_path, 'w') as savefile:
        wr = csv.writer(savefile, quoting=csv.QUOTE_ALL)
        for line in lines:
            wr.writerow(line.split("|"))

可能没有足够的内存来加载整个文件。在这种情况下,您可以将其分块转储,如果足够大,则最终仍可以节省大量I / O。下面的分块方法非常快捷,肮脏。

from collections import defaultdict
from itertools import groupby
from os.path import join

chunk_size = 10000  # units of lines

file_path = ".../master.tsv"

with open(file_path, 'r') as masterfile:
    for _, chunk in groupby(enumerate(masterfile),
                            key=lambda item: item[0] // chunk_size):
        data = defaultdict(list)
        for line in chunk:
            cik = line.split("|", 1)[0].zfill(10)
            data[cik].append(line)
        for cik, lines in data.items():
            save_path = join(".../data-sorted", cik + ".csv")

            with open(save_path, 'a') as savefile:
                wr = csv.writer(savefile, quoting=csv.QUOTE_ALL)
                for line in lines:
                    wr.writerow(line.split("|"))