Question

我尝试应用过滤器来删除dask数据帧中包含过多NA的列：

df.dropna(axis=1, how='all', thresh=round(len(df) * .8))

不幸的是，似乎dropna的API与熊猫API稍有不同，并且不接受axis或threshold。解决该问题的一种方法是逐列迭代并删除那些常量（无论它们是否填充有NA，因为我不介意删除常量）：

    for col in df.columns:
        if len(df[col].unique()) == 1:
            new_df = df.drop(col, axis = 1)

但是，这不允许我应用阈值。我可以通过添加以下内容来手动计算阈值：

elif sum(df[col].isnull().compute()) / len(df[col]) > 0.8:
    new_df = df.drop(col, axis = 1)

但是我不确定此时调用compute和len是否是最佳选择，我很想知道是否有更好的方法来解决这个问题？

Answer 1

您是对的，无法使用df.dropna()来做到这一点。

我建议使用这个方程式 df.loc[:,df.isnull().sum()<THRESHOLD]

Answer 2

我们遇到了类似的问题，并使用了以下代码：

for col in df.columns:
    if df[col].isnull().all().compute()=True:
        df = df.drop(col,axis=1)

Answer 3

<块引用>

df.loc[:,df.isnull().sum()

产生KeyError，因为你需要计算索引器：

KeyError: "None of [Index([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,\n        True,  True,  True, False, False, False, False, False,  True],\n      dtype='object')] are in the [columns]"

结合skibee和Starukhin Yaroslav的回答，我使用：

df.loc[:, ~df.isna().all().compute()]

如果你想使用阈值，你可以使用：

df.loc[:, ~df.isna().sum().compute() > THRESHOLD]

达斯（Dask）：将NA放在列上？

3 个答案: