想象一下我在一个数据帧中具有ID和三个可能的标签的数据,例如:
+-------------------+-------+
| ID | TYPE |
+-------------------+-------+
| Lord of the Rings | Movie |
| Lord of the Rings | Book |
| Lord of the Rings | Game |
| Alien | Movie |
| Alien | Game |
| Fight Club | Book |
| Fight Club | Movie |
| Scar Face | Movie |
| God of War | Game |
| Tomb Raider | Movie |
| Tomb Raider | Game |
| Borderlands | Game |
| Ulysses | Book |
+-------------------+-------+
我要做的本质上是对数据进行一次热编码,以便添加三列Movie
,Book
,Game
,这些列经过二进制编码以显示该类型每个ID为true或false。但是,有了这些数据,就不会考虑重复项。例如,如果我使用pd.get_dummies
,我最终会得到
+-------------------+-------+-------+------+------+
| ID | TYPE | Movie | Game | Book |
+-------------------+-------+-------+------+------+
| Lord of the Rings | Movie | 1 | 0 | 0 |
| Lord of the Rings | Book | 0 | 0 | 1 |
| Lord of the Rings | Game | 0 | 1 | 0 |
| Alien | Movie | 1 | 0 | 0 |
| Alien | Game | 0 | 1 | 0 |
| Fight Club | Book | 0 | 0 | 1 |
| Fight Club | Movie | 1 | 0 | 0 |
| Scar Face | Movie | 1 | 0 | 0 |
| God of War | Game | 0 | 1 | 0 |
| Tomb Raider | Movie | 1 | 0 | 0 |
| Tomb Raider | Game | 0 | 1 | 0 |
| Borderlands | Game | 0 | 1 | 0 |
| Ulysses | Book | 0 | 0 | 1 |
+-------------------+-------+-------+------+------+
这与预期的一样,为每个记录提供了一个新行。所以我的问题是,我可以将这些数据放入
+-------------------+-------------------+-------+------+------+
| ID | TYPE | Movie | Game | Book |
+-------------------+-------------------+-------+------+------+
| Lord of the Rings | [Movie,Game,Book] | 1 | 1 | 1 |
| Alien | [Movie,Game] | 1 | 1 | 0 |
| Fight Club | [Movie,Book] | 1 | 0 | 1 |
| Scar Face | [Movie] | 1 | 0 | 0 |
| God of War | [Game] | 0 | 1 | 0 |
| Tomb Raider | [Movie,Game] | 1 | 1 | 0 |
| Borderlands | [Game] | 0 | 1 | 0 |
| Ulysses | [Book] | 0 | 0 | 1 |
+-------------------+-------------------+-------+------+------+
没有完全转换我的数据吗?基本上,我想在ID中找到所有重复的条目并将它们连接在一起,以便给定唯一ID的所有类型都放在一个位置(理想情况下在单个记录的列表中),然后在一个位置中对其进行一次热编码。我可以在同一行中看到TYPE
的所有true或false值,并与(现在)唯一ID对齐。
答案 0 :(得分:3)
您可以这样做:
(pd.concat( (pd.get_dummies(df['Type']), df), axis=1, sort=False)
.groupby('ID', as_index=False, sort=False)
.agg({'TYPE': list, 'Movie':'sum', 'Game':'sum', 'Book':'sum'})
)
输出:
ID TYPE Movie Game Book
0 Lord of the Rings [Movie, Book, Game] 1 1 1
1 Alien [Movie, Game] 1 1 0
2 Fight Club [Book, Movie] 1 0 1
3 Scar Face [Movie] 1 0 0
4 God of War [Game] 0 1 0
5 Tomb Raider [Movie, Game] 1 1 0
6 Borderlands [Game] 0 1 0
7 Ulysses [Book] 0 0 1
答案 1 :(得分:2)
您可以在get_dummies
和groupby()
之后使用str.join()
:
final=df.groupby('ID',sort=False).agg(list)
final.assign(**final['TYPE'].str.join('|').str.get_dummies()).reset_index()
ID TYPE Book Game Movie
0 Lord of the Rings [Movie, Book, Game] 1 1 1
1 Alien [Movie, Game] 0 1 1
2 Fight Club [Book, Movie] 1 0 1
3 Scar Face [Movie] 0 0 1
4 God of War [Game] 0 1 0
5 Tomb Raider [Movie, Game] 0 1 1
6 Borderlands [Game] 0 1 0
7 Ulysses [Book] 1 0 0
答案 2 :(得分:2)
MultiLableBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
final = df.groupby('ID', as_index=False, sort=False).agg(list)
mlb = MultiLabelBinarizer()
a = mlb.fit_transform(final.TYPE)
final.assign(**dict(zip(mlb.classes_, a.T)))
ID TYPE Book Game Movie
0 Lord of the Rings [Movie, Book, Game] 1 1 1
1 Alien [Movie, Game] 0 1 1
2 Fight Club [Book, Movie] 1 0 1
3 Scar Face [Movie] 0 0 1
4 God of War [Game] 0 1 0
5 Tomb Raider [Movie, Game] 0 1 1
6 Borderlands [Game] 0 1 0
7 Ulysses [Book] 1 0 0
value_counts
df.groupby('ID', sort=False).pipe(
lambda g: g.agg(list).join(g.TYPE.value_counts().unstack(fill_value=0))
).reset_index()