如何在Python / Pandas中构建“多热”?

时间:2016-05-06 17:48:40

标签: python pandas

我需要将三列分类数据组合成一组二进制类别命名列。这类似于“一热”,但源行最多有三个类别而不是一个。此外,请注意,有100多个类别,我不会事先知道它们。

id, fruit1, fruit2, fruit3
1, apple, orange,
2, orange, , 
3, banana, apple,

应该生成......

id, apple, banana, orange
1, 1, 0, 1
2, 0, 0, 1
3, 1, 1, 0

2 个答案:

答案 0 :(得分:1)

您可以使用pd.melt将所有水果列合并为一列,并使用pd.crosstab创建频率表:

int main(void) {
    char str[] = "          Hello babz what's up with you?!";
    char *ptr = str;
    while (*ptr == ' ') ptr++;
    puts(ptr);
    return 0;
}

产量

import numpy as np
import pandas as pd

df = pd.read_csv('data')
df = df.replace(r' ', np.nan)
#    id   fruit1   fruit2   fruit3
# 0   1    apple   orange      NaN
# 1   2   orange      NaN      NaN
# 2   3   banana    apple      NaN

melted = pd.melt(df, id_vars=['id'])
result = pd.crosstab(melted['id'], melted['value'])
print(result)

说明:融化的DataFrame如下所示:

value   apple   banana   orange
id                             
1           1        0        1
2           0        0        1
3           1        1        0

我们可以忽略In [148]: melted = pd.melt(df, id_vars=['id']); melted Out[149]: id variable value 0 1 fruit1 apple 1 2 fruit1 orange 2 3 fruit1 banana 3 1 fruit2 orange 4 2 fruit2 NaN 5 3 fruit2 apple 6 1 fruit3 NaN 7 2 fruit3 NaN 8 3 fruit3 NaN 列; variableid很重要。 value可用于创建频率表,其中索引中包含pd.crosstab值,melted['id']值为列:

melted['value']

答案 1 :(得分:0)

您可以为每一行应用值计数:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'fruit1': ['Apple', 'Banana', np.nan],
    'fruit2': ['Banana', np.nan, 'Apple'],
    'fruit3': ['Grape', np.nan, np.nan],
    })

df = df.apply(lambda row: row.value_counts(), axis=1).fillna(0).applymap(int)

在:

   fruit1  fruit2 fruit3
0   Apple  Banana  Grape
1  Banana     NaN    NaN
2     NaN   Apple    NaN

后:

   Apple  Banana  Grape
0      1       1      1
1      0       1      0
2      1       0      0