Chunk Them All！

Question

所以我目前有一个目录，我们称之为 / mydir ，它包含36个CSV文件，每个2.1 GB并具有相同的尺寸。它们都是相同的大小，我想将它们读入熊猫，并排将它们连接在一起（因此行数保持不变），然后将结果数据帧输出为一个大的csv。我的代码用于组合其中的一些但在某个点之后达到内存错误。我想知道是否有一种更有效的方法来做到这一点。

df = pd.DataFrame()
for file in os.listdir('/mydir'):
    df.concat([df, pd.read_csv('/mydir' + file, dtype = 'float)], axis = 1)
df.to_csv('mydir/file.csv')

建议我把它分成小块，将文件组合成6组，然后依次将这些文件组合在一起，但我不知道这是否是一个有效的解决方案，可以避免内存错误问题

编辑：目录视图：

-rw-rw---- 1 m2762 2.1G Jul 11 10:35 2010.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:32 2001.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:28 1983.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 2009.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 1991.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:07 2000.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:06 1982.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 1990.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 2008.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:55 1999.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:54 1981.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 2007.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1998.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1989.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1980.csv

Answer 1

Chunk Them All！

from glob import glob
import os

# grab files
files = glob('./[0-9][0-9][0-9][0-9].csv')

# simplify the file reading
# notice this will create a generator
# that goes through chunks of the file
# at a time
def read_csv(f, n=100):
    return pd.read_csv(f, index_col=0, chunksize=n)

# simplify the concatenation
def concat(lot):
    return pd.concat(lot, axis=1)

# simplify the writing
# make sure mode is append and header is off
# if file already exists
def to_csv(f, df):
    if os.path.exists(f):
        mode = 'a'
        header = False
    else:
        mode = 'w'
        header = True
    df.to_csv(f, mode=mode, header=header)

# Fun stuff! zip will take the next element of the generator
# for each generator created for each file
# concat one chunk at a time and write
for lot in zip(*[read_csv(f, n=10) for f in files]):
    to_csv('out.csv', concat(lot))

Answer 2

假设MaxU的答案是所有文件都具有相同的行数，并且假设在所有文件中进一步假设像引用这样的微小CSV差异，则不需要使用Pandas执行此操作。常规文件readlines将为您提供可以连接和写出的字符串。进一步假设您可以提供行数。类似这样的代码：

    numrows = 999 # whatever.  Probably pass as argument to function or on cmdline
    out_file = open('myout.csv','w')
    infile_names = [ 'file01.csv',
                     'file02.csv',
                      ..
                     'file36.csv' ]

    # open all the input files
    infiles = []
    for fname in infile_names:
        infiles.append(open(fname))

    for i in range(numrows):
        # read a line from each input file and add it to the output string
        out_csv=''
        for infile2read in infiles:
            out_csv += infile2read.readline().strip() + ','
        out_csv[-1] = '\n' # replace final comma with newline

        # write this rows data out to the output file
        outfile.write(out_csv)

    #close the files
    for f in infiles:
        f.close()
    outfile.close()

Pandas将多个CSV和输出组合为一个大文件

2 个答案:

Chunk Them All！