我正在尝试读取349个csv文件,所有文件都具有相同的列和c。总共15gb,并将它们合并为1个数据帧。但是,我一直得到MemoryError
,因此尝试每10个文件使用10-20秒的睡眠时间。我的下面代码设法将它们读入dfs列表,尽管有时会崩溃。
import glob
import os
import time
import pandas as pd
path = r"C:\path\*\certificates.csv"
files = []
for filename in glob.iglob(path, recursive=True):
files.append(filename)
#print(filename)
dfs = []
sleep_for = 20
counter = 0
for file in files:
counter += 1
if counter % 10 == 0:
time.sleep(sleep_for)
print("\nSleeping for " + str(sleep_for) + " seconds.\nProceeding to append df " + str(counter))
df = pd.read_csv(file)
df = df[keep_cols] # A list of cols to keep - same in every file
dfs.append(df)
else:
df = pd.read_csv(file)
df = df[domestic_keep_cols]
dfs.append(df)
print('Appending df ' + str(counter))
df_combined = pd.concat(dfs)
但是,当我在dfs列表上尝试pd.concat
时,我得到了MemoryError
。我尝试通过一次附加10个df来解决此问题:
lower_limit = 0
upper_limit = 10
counter = 0
while counter < len(dfs):
counter += 1
target_dfs = dfs[lower_limit:upper_limit]
if counter % 10 == 0:
lower_limit += 10
upper_limit += 10
target_dfs = dfs[lower_limit:upper_limit]
for each_df in target_dfs:
df_combined = df_combined.append(each_df)
else:
for each_df in target_dfs:
df_combined = df_combined.append(each_df)
但是,这也会抛出MemoryError
,是否有更有效的方法来执行此操作,或者我做错了什么而抛出了MemoryError
?也许熊猫是这项工作的错误工具?