时序

Question

我有一个带有各种列的熊猫DF（每个列表示语料库中单词的频率）。每行对应一个文档，每个都是float64类型。

例如：

word1 word2 word3
0.0   0.3   1.0
0.1   0.0   0.5
etc

我想要二进制化而不是频率最终得到一个布尔值（0和1s DF），表示存在一个单词

所以上面的例子将转换为：

word1 word2 word3
0      1     1
1      0     1
etc

我查看了get_dummies（），但输出不是预期的。

Answer 1

对于任何非零的内容，转换为布尔值将导致True - 对于任何零条目，将导致False。如果然后转换为整数，则得到1和0。

import io
import pandas as pd

data = io.StringIO('''\
word1 word2 word3
0.0   0.3   1.0
0.1   0.0   0.5
''')
df = pd.read_csv(data, delim_whitespace=True)

res = df.astype(bool).astype(int)
print(res)

输出：

   word1  word2  word3
0      0      1      1
1      1      0      1

Answer 2

我会回答@Alberto Garcia-Raboso的回答，但这是一个非常快速的选择并且利用相同的想法。

使用np.where

pd.DataFrame(np.where(df, 1, 0), df.index, df.columns)

时序

Answer 3

代码：

import numpy as np
import pandas as pd

""" create some test-data """
random_data = np.random.random([3, 3])
random_data[0,0] = 0.0
random_data[1,2] = 0.0

df = pd.DataFrame(random_data,
     columns=['A', 'B', 'C'], index=['first', 'second', 'third'])

print(df)

""" binarize """
threshold = lambda x: x > 0
df_ = df.apply(threshold).astype(int)

print(df_)

输出：

A         B         C
first   0.000000  0.610263  0.301024
second  0.728070  0.229802  0.000000
third   0.243811  0.335131  0.863908
A  B  C
first   0  1  1
second  1  1  0
third   1  1  1

说明：

get_dummies（）分析每列的每个唯一值，并引入新列（针对每个唯一值）以标记此值是否有效
=如果列A有20个唯一值，则添加20个新列，其中只有一列为真，其他列为假

Answer 4

使用Pandas Indexing找到另一种方法。

这可以通过

完成

df[df>0] = 1

这很简单！

在Python中二进制化float64 Pandas Dataframe

4 个答案:

时序