Question

我正在使用pandas对一组约1000-2000个CSV文件进行outer合并。每个CSV文件都有一个标识符列id，它在所有CSV文件之间共享，但每个文件都有一组3-5列的唯一列。每个文件中大约有20,000个唯一id行。我想要做的就是将这些组合在一起，将所有新列组合在一起，并使用id列作为合并索引。

我使用简单的merge调用来执行此操作：

merged_df = first_df # first csv file dataframe
for next_filename in filenames:
   # load up the next df
   # ...
   merged_df = merged_df.merge(next_df, on=["id"], how="outer")

问题是，有近2000个CSV文件，我在pandas抛出的MemoryError操作中得到merge。由于合并操作中的问题，我不确定这是否是一个限制？

最终的数据帧将有20,000行，大约（2000 x 3）= 6000列。这个很大，但不够大，不足以占用我使用的计算机上的所有内存，它有超过20 GB的RAM。对于大熊猫的操纵，这个尺寸太大了吗？我应该使用像sqlite这样的东西吗？我可以在merge操作中更改某些内容以使其按此比例工作吗？

感谢。

Answer 1

我认为使用concat（其作用类似于外连接）会获得更好的性能：

dfs = (pd.read_csv(filename).set_index('id') for filename in filenames)
merged_df = pd.concat(dfs, axis=1)

这意味着您只对每个文件执行一次合并操作而不是一次。

Answer 2

我使用带有1GB文件的read_csv在32位pyt中遇到了同样的错误。尝试64位版本，希望能解决内存错误问题

Answer 3

pd.concat似乎对大型数据帧的内存不足，一种选择是将dfs转换为矩阵并将它们连接起来。

def concat_df_by_np(df1,df2):
    """
    accepts two dataframes, converts each to a matrix, concats them horizontally and
    uses the index of the first dataframe. This is not a concat by index but simply by
    position, therefore the index of both dataframes should be the same
    """
    dfout = deepcopy(pd.DataFrame(np.concatenate( (df1.as_matrix(),df2.as_matrix()),axis=1),
                                  index   = df1.index, 
                                  columns = np.concatenate([df1.columns,df2.columns])))
    if (df1.index!=df2.index).any():
       #logging.warning('Indices in concat_df_by_np are not the same')                     
       print ('Indices in concat_df_by_np are not the same')                     


    return dfout

但是，需要注意的是，这个函数不是连接，而是水平追加，而忽略索引

大错误的MemoryError与Python中的pandas合并

3 个答案: