应用错误收集

大熊猫CSV案例中的dropna（）出现KeyError：MemoryError

时间：2019-03-20 13:04:03

标签： python pandas dataframe

我有pd.read_csv（）加载的DF。 500mb，4列，50xxx行我需要删除第3列（Allele1-AB）或第4列（Allele2-AB）中0或-（gap）的所有行。

我的代码： 读取CSV

data_skipped = pd.read_csv(cwd + file_list[i], sep='\t', skiprows = row_skipped_value, header = 0, index_col = False, dtype=object, low_memory = True)

删除空白

fixed_data = fixed_data.loc[fixed_data['Allele1 - AB' or 'Allele2 - AB'] != gap].dropna()

删除空白行后出现错误：

KeyError: MemoryError()

如果我删除该行，就可以了，接下来的步骤也可以正常工作（但是结果是我的文件带有空格）。 14GB的可用RAM。

有任何建议或解决方案吗？

2 个答案:

答案 0 :(得分：1)

您的代码无法满足您的要求。

Pandas不使用and，or等作为布尔运算符。请参见Pandas documentation：

另一种常见的操作是使用布尔向量来过滤数据。操作员是：为或，＆为与，和〜为非。这些必须必须使用括号分组

因此，您应该以这种方式过滤数据。代替：

fixed_data = fixed_data.loc[fixed_data['Allele1 - AB' or 'Allele2 - AB'] != gap].dropna()

要做：

fixed_data.loc[(fixed_data['Allele1 - AB'] != gap) | (fixed_data['Allele2 - AB'] != gap)].dropna()

这不需要其他软件包的额外导入。

答案 1 :(得分：-1)

您应该尝试使用

fixed_data.loc[fixed_data['Allele1 - AB' or 'Allele2 - AB'] != gap].dropna(inplace=True)

，而不重新分配它（它将返回None）。在这种情况下，将不会创建数组的副本，请参见here。

更新：我认为您的代码起初没有多大意义。 'Allele1 - AB' or 'Allele2 - AB'将始终计算为'Allele1 - AB'。我猜想您要删除包含NaN的所有行，并且仅保留其中列Allele1 - AB不等于gap并且Allele2 - AB不等于gap的行

在这种情况下，请使用：

import numpy as np
fixed_data = fixed_data[np.logical_or(fixed_data["Allele1 - AB"] != gap, fixed_data["Allele2 - AB"] != gap)].dropna()