我有一个具有以下特征的pandas DataFrame:
tag_id
object_id
1 77
2 77
3 91
4 91
5 91
6 91
7 77
8 91
9 85
10 88
10 211
11 100
12 81
12 91
13 65
14 73
15 91
16 174
17 91
18 62
19 62
20 91
... ...
1527 105
1527 108
1528 87
1529 91
1907 rows × 1 columns
如您所见,某些索引值实际上确实会重复一个不同的“ tag_id”值。我想用OneHotEncoder重新组织此DataFrame,以将其转换为具有二进制值的稀疏矩阵,如下所示:
1 2 3 ... 77 ... 85 ... 88 ... 91 ... 211
object_id
1 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
2 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
3 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
4 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
5 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
6 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
7 0 0 0 ... 1 ... 0 ... 0 ... 0 ... 0
8 0 0 0 ... 0 ... 0 ... 0 ... 1 ... 0
9 0 0 0 ... 0 ... 1 ... 0 ... 0 ... 0
10 0 0 0 ... 0 ... 0 ... 1 ... 0 ... 1
等等
使用pd.get_dummies(df ['tag_id'])给了我我想要的东西,但是它并没有堆积具有重复索引的行,因此我仍然得到1907行,而不是1907行-重复次数。 / p>
有什么办法可以解决这个问题吗?
答案 0 :(得分:1)
只需sum
pd.get_dummies(df['tag_id']).sum(level=0).ne(0).astype(int)
或删除重复项
pd.get_dummies(df['tag_id'].groupby(level=0).first())
答案 1 :(得分:0)
除了文本的出色回答外,我还找到了另一种选择:
# Definition of categories (df_str is a master list of all possible 'tag_id' values)
cat = [int(x) for x in sorted(df_str['id'].unique())]
# Definition of data
data = df.groupby(df.index).agg(list)
data = data['tag_id'].apply(lambda row: [int(el) for el in row])
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer(classes = cat).fit(data)
encoded_data = mlb.transform(data)
df_tags_encoded = pd.DataFrame(data = encoded_data, index = data.index, columns = ["tag_id_" + str(name) for name in cat])
df_tags_encoded.head(10)
57 58 59 60 61 62 63 64 65 66 ... 203 204 205 206 207 208 209 210 211 212
object_id
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
10 rows × 156 columns