Question

在使用MemoryError: cannot allocate memory for array检查Python 3.6.4中数据框中的重复项时，我得到df.duplicated()。

df有大约150,000行和208列，将数据加载到df中没有问题（使用下面的块）。

myList = []
for chunks in pd.read_csv(filename, header=0, low_memory=False, chunksize=20000):
        myList.append(chunks)

dfMain = pd.concat(myList, axis=0)
dfMain.index.name = 'Index'

print (dfMain.shape)
Out: (151982, 208)

到目前为止，一切都很好。

   #Marks duplicated rows with TRUE or FALSE and put into a new df
    dfDup1 = pd.DataFrame(dfMain.duplicated(keep=False)) #set to False to view all duplicates

这是发生错误的地方：MemoryError: cannot allocate memory for array并且脚本停止。

不幸的是，减少列数不是一个选项，我需要检查所有变量的重复项（尽管我确实丢弃了150个变量来测试，但问题仍然存在）。我确实需要将重复的值导出到df / csv，因此在此阶段不能使用drop_duplicates()。

计算机有足够的RAM（64演出），但Python / pandas只使用了一小部分。

任何帮助将不胜感激。

Answer 1

这里的问题是使用Python 32位而不是64位。感谢abrnert帮助解决这个问题。

Python / Pandas - df.duplicated（）MemoryError：无法为数组分配内存

1 个答案: