所以我目前有一个目录,我们称之为 / mydir ,它包含36个CSV文件,每个2.1 GB并具有相同的尺寸。它们都是相同的大小,我想将它们读入熊猫,并排将它们连接在一起(因此行数保持不变),然后将结果数据帧输出为一个大的csv。我的代码用于组合其中的一些但在某个点之后达到内存错误。我想知道是否有一种更有效的方法来做到这一点。
df = pd.DataFrame()
for file in os.listdir('/mydir'):
df.concat([df, pd.read_csv('/mydir' + file, dtype = 'float)], axis = 1)
df.to_csv('mydir/file.csv')
建议我把它分成小块,将文件组合成6组,然后依次将这些文件组合在一起,但我不知道这是否是一个有效的解决方案,可以避免内存错误问题
编辑:目录视图:
-rw-rw---- 1 m2762 2.1G Jul 11 10:35 2010.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:32 2001.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:28 1983.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 2009.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:21 1991.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:07 2000.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:06 1982.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 1990.csv
-rw-rw---- 1 m2762 2.1G Jul 11 10:01 2008.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:55 1999.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:54 1981.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 2007.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1998.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1989.csv
-rw-rw---- 1 m2762 2.1G Jul 11 09:42 1980.csv
答案 0 :(得分:1)
from glob import glob
import os
# grab files
files = glob('./[0-9][0-9][0-9][0-9].csv')
# simplify the file reading
# notice this will create a generator
# that goes through chunks of the file
# at a time
def read_csv(f, n=100):
return pd.read_csv(f, index_col=0, chunksize=n)
# simplify the concatenation
def concat(lot):
return pd.concat(lot, axis=1)
# simplify the writing
# make sure mode is append and header is off
# if file already exists
def to_csv(f, df):
if os.path.exists(f):
mode = 'a'
header = False
else:
mode = 'w'
header = True
df.to_csv(f, mode=mode, header=header)
# Fun stuff! zip will take the next element of the generator
# for each generator created for each file
# concat one chunk at a time and write
for lot in zip(*[read_csv(f, n=10) for f in files]):
to_csv('out.csv', concat(lot))
答案 1 :(得分:0)
假设MaxU的答案是所有文件都具有相同的行数,并且假设在所有文件中进一步假设像引用这样的微小CSV差异,则不需要使用Pandas执行此操作。常规文件readlines
将为您提供可以连接和写出的字符串。进一步假设您可以提供行数。类似这样的代码:
numrows = 999 # whatever. Probably pass as argument to function or on cmdline
out_file = open('myout.csv','w')
infile_names = [ 'file01.csv',
'file02.csv',
..
'file36.csv' ]
# open all the input files
infiles = []
for fname in infile_names:
infiles.append(open(fname))
for i in range(numrows):
# read a line from each input file and add it to the output string
out_csv=''
for infile2read in infiles:
out_csv += infile2read.readline().strip() + ','
out_csv[-1] = '\n' # replace final comma with newline
# write this rows data out to the output file
outfile.write(out_csv)
#close the files
for f in infiles:
f.close()
outfile.close()