Question

我有一个包含大量列的pandas数据帧，我需要查找哪些列是二进制的（仅值0或1）而不查看数据。应该使用哪个功能？

Answer 1

据我所知，没有直接的功能来测试它。相反，您需要根据数据的编码方式（例如1/0，T / F，True / False等）构建内容。此外，如果您的列具有缺失值，则整个列将被编码为float而不是int。

在下面的示例中，我测试所有唯一的非空值是“1”还是“0”。它返回所有这些列的列表。

df = pd.DataFrame({'bool': [1, 0, 1, None], 
                   'floats': [1.2, 3.1, 4.4, 5.5], 
                   'ints': [1, 2, 3, 4], 
                   'str': ['a', 'b', 'c', 'd']})

bool_cols = [col for col in df 
             if df[[col]].dropna().unique().isin([0, 1]).all().values]

>>> bool_cols
['bool']

>>> df[bool_cols]
   bool
0     1
1     0
2     1
3   NaN

Answer 2

def is_binary(series, allow_na=False):
    if allow_na:
        series.dropna(inplace=True)
    return sorted(series.unique()) == [0, 1]

这是我找到的最有效的解决方案。它比上面的答案更快。处理大型数据集时，时间上的差异变得很重要。

Answer 3

要扩展上面的答案，使用value_counts（）。index而不是unique（）应该可以解决问题：

tables._buckets[bucketNo] = newNode

Answer 4

改进@Aiden以避免返回空列：

[col for col in df if (len(df[col].value_counts()) > 0) & all(df[col].value_counts().index.isin([0, 1]))]

Answer 5

使用Alexander的答案以及python版本-3.6.6

[col for col in df if np.isin(df[col].unique(), [0, 1]).all()]

Answer 6

您只需在数据集中每一列上使用pandas中的unique（）函数。

例如：df["colname"].unique()

这将返回指定列中所有唯一值的列表。

您还可以使用for循环遍历数据集中的所有列。

例如：[df[cols].unique() for cols in df]

Pandas DataFrame中哪些列是二进制的？

6 个答案: