在熊猫中处理MemoryError

时间:2019-02-19 12:16:23

标签: python python-3.x pandas

我正在尝试读取349个csv文件,所有文件都具有相同的列和c。总共15gb,并将它们合并为1个数据帧。但是,我一直得到MemoryError,因此尝试每10个文件使用10-20秒的睡眠时间。我的下面代码设法将它们读入dfs列表,尽管有时会崩溃。

import glob
import os
import time
import pandas as pd 

path = r"C:\path\*\certificates.csv"
files = []
for filename in glob.iglob(path, recursive=True):
    files.append(filename) 
    #print(filename)

dfs = []
sleep_for = 20
counter = 0
for file in files: 
    counter += 1 
    if counter % 10 == 0:
        time.sleep(sleep_for)
        print("\nSleeping for " + str(sleep_for) + " seconds.\nProceeding to append df " + str(counter))
        df = pd.read_csv(file)
        df = df[keep_cols] # A list of cols to keep - same in every file
        dfs.append(df)        
    else:    
        df = pd.read_csv(file)
        df = df[domestic_keep_cols]
        dfs.append(df)
        print('Appending df ' + str(counter))
df_combined = pd.concat(dfs)

但是,当我在dfs列表上尝试pd.concat时,我得到了MemoryError。我尝试通过一次附加10个df来解决此问题:

lower_limit = 0
upper_limit = 10
counter = 0

while counter < len(dfs):   
    counter += 1 
    target_dfs = dfs[lower_limit:upper_limit]
    if counter % 10 == 0:
        lower_limit += 10
        upper_limit += 10
        target_dfs = dfs[lower_limit:upper_limit]
        for each_df in target_dfs:
            df_combined = df_combined.append(each_df)
    else:
        for each_df in target_dfs:
            df_combined = df_combined.append(each_df)

但是,这也会抛出MemoryError,是否有更有效的方法来执行此操作,或者我做错了什么而抛出了MemoryError?也许熊猫是这项工作的错误工具?

0 个答案:

没有答案