我需要将三列分类数据组合成一组二进制类别命名列。这类似于“一热”,但源行最多有三个类别而不是一个。此外,请注意,有100多个类别,我不会事先知道它们。
id, fruit1, fruit2, fruit3
1, apple, orange,
2, orange, ,
3, banana, apple,
应该生成......
id, apple, banana, orange
1, 1, 0, 1
2, 0, 0, 1
3, 1, 1, 0
答案 0 :(得分:1)
您可以使用pd.melt
将所有水果列合并为一列,并使用pd.crosstab
创建频率表:
int main(void) {
char str[] = " Hello babz what's up with you?!";
char *ptr = str;
while (*ptr == ' ') ptr++;
puts(ptr);
return 0;
}
产量
import numpy as np
import pandas as pd
df = pd.read_csv('data')
df = df.replace(r' ', np.nan)
# id fruit1 fruit2 fruit3
# 0 1 apple orange NaN
# 1 2 orange NaN NaN
# 2 3 banana apple NaN
melted = pd.melt(df, id_vars=['id'])
result = pd.crosstab(melted['id'], melted['value'])
print(result)
说明:融化的DataFrame如下所示:
value apple banana orange
id
1 1 0 1
2 0 0 1
3 1 1 0
我们可以忽略In [148]: melted = pd.melt(df, id_vars=['id']); melted
Out[149]:
id variable value
0 1 fruit1 apple
1 2 fruit1 orange
2 3 fruit1 banana
3 1 fruit2 orange
4 2 fruit2 NaN
5 3 fruit2 apple
6 1 fruit3 NaN
7 2 fruit3 NaN
8 3 fruit3 NaN
列; variable
和id
很重要。
value
可用于创建频率表,其中索引中包含pd.crosstab
值,melted['id']
值为列:
melted['value']
答案 1 :(得分:0)
您可以为每一行应用值计数:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'fruit1': ['Apple', 'Banana', np.nan],
'fruit2': ['Banana', np.nan, 'Apple'],
'fruit3': ['Grape', np.nan, np.nan],
})
df = df.apply(lambda row: row.value_counts(), axis=1).fillna(0).applymap(int)
在:
fruit1 fruit2 fruit3
0 Apple Banana Grape
1 Banana NaN NaN
2 NaN Apple NaN
后:
Apple Banana Grape
0 1 1 1
1 0 1 0
2 1 0 0