使用OneHotEncoder扩展系列

时间:2019-05-01 00:24:45

标签: python pandas one-hot-encoding

我有一个具有以下特征的pandas DataFrame:

            tag_id
object_id   
    1           77
    2           77
    3           91
    4           91
    5           91
    6           91
    7           77
    8           91
    9           85
    10          88
    10          211
    11          100
    12          81
    12          91
    13          65
    14          73
    15          91
    16          174
    17          91
    18          62
    19          62
    20          91
    ...         ...
    1527        105
    1527        108
    1528        87
    1529        91

    1907 rows × 1 columns

如您所见,某些索引值实际上确实会重复一个不同的“ tag_id”值。我想用OneHotEncoder重新组织此DataFrame,以将其转换为具有二进制值的稀疏矩阵,如下所示:

            1    2    3    ...    77    ...    85    ...    88    ...    91    ...    211
object_id
    1       0    0    0    ...    1     ...    0     ...     0    ...    0     ...     0
    2       0    0    0    ...    1     ...    0     ...     0    ...    0     ...     0
    3       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    4       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    5       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    6       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    7       0    0    0    ...    1     ...    0     ...     0    ...    0     ...     0
    8       0    0    0    ...    0     ...    0     ...     0    ...    1     ...     0
    9       0    0    0    ...    0     ...    1     ...     0    ...    0     ...     0
    10      0    0    0    ...    0     ...    0     ...     1    ...    0     ...     1

等等

使用pd.get_dummies(df ['tag_id'])给了我我想要的东西,但是它并没有堆积具有重复索引的行,因此我仍然得到1907行,而不是1907行-重复次数。 / p>

有什么办法可以解决这个问题吗?

2 个答案:

答案 0 :(得分:1)

只需sum

pd.get_dummies(df['tag_id']).sum(level=0).ne(0).astype(int)

或删除重复项

pd.get_dummies(df['tag_id'].groupby(level=0).first())

答案 1 :(得分:0)

除了文本的出色回答外,我还找到了另一种选择:

# Definition of categories (df_str is a master list of all possible 'tag_id' values)
cat = [int(x) for x in sorted(df_str['id'].unique())]

# Definition of data
data = df.groupby(df.index).agg(list)
data = data['tag_id'].apply(lambda row: [int(el) for el in row])

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer(classes = cat).fit(data)
encoded_data = mlb.transform(data)

df_tags_encoded = pd.DataFrame(data = encoded_data, index = data.index, columns = ["tag_id_" + str(name) for name in cat])
df_tags_encoded.head(10)

        57  58  59  60  61  62  63  64  65  66  ...     203 204 205 206 207 208 209 210 211 212
object_id                                                                                   
    1   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    2   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    3   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    4   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    5   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    6   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    7   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    8   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    9   0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   0   0
    10  0   0   0   0   0   0   0   0   0   0   ...     0   0   0   0   0   0   0   0   1   0

10 rows × 156 columns