二进制编码(类似于一次性编码),但在单个列和一行中允许多个值

时间:2019-11-26 17:47:26

标签: python pandas dataframe

想象一下我在一个数据帧中具有ID和三个可能的标签的数据,例如:

+-------------------+-------+
|        ID         | TYPE  |
+-------------------+-------+
| Lord of the Rings | Movie |
| Lord of the Rings | Book  |
| Lord of the Rings | Game  |
| Alien             | Movie |
| Alien             | Game  |
| Fight Club        | Book  |
| Fight Club        | Movie |
| Scar Face         | Movie |
| God of War        | Game  |
| Tomb Raider       | Movie |
| Tomb Raider       | Game  |
| Borderlands       | Game  |
| Ulysses           | Book  |
+-------------------+-------+

我要做的本质上是对数据进行一次热编码,以便添加三列MovieBookGame,这些列经过二进制编码以显示该类型每个ID为true或false。但是,有了这些数据,就不会考虑重复项。例如,如果我使用pd.get_dummies,我最终会得到

+-------------------+-------+-------+------+------+
|        ID         | TYPE  | Movie | Game | Book |
+-------------------+-------+-------+------+------+
| Lord of the Rings | Movie |     1 |    0 |    0 |
| Lord of the Rings | Book  |     0 |    0 |    1 |
| Lord of the Rings | Game  |     0 |    1 |    0 |
| Alien             | Movie |     1 |    0 |    0 |
| Alien             | Game  |     0 |    1 |    0 |
| Fight Club        | Book  |     0 |    0 |    1 |
| Fight Club        | Movie |     1 |    0 |    0 |
| Scar Face         | Movie |     1 |    0 |    0 |
| God of War        | Game  |     0 |    1 |    0 |
| Tomb Raider       | Movie |     1 |    0 |    0 |
| Tomb Raider       | Game  |     0 |    1 |    0 |
| Borderlands       | Game  |     0 |    1 |    0 |
| Ulysses           | Book  |     0 |    0 |    1 |
+-------------------+-------+-------+------+------+

这与预期的一样,为每个记录提供了一个新行。所以我的问题是,我可以将这些数据放入

+-------------------+-------------------+-------+------+------+
|        ID         |       TYPE        | Movie | Game | Book |
+-------------------+-------------------+-------+------+------+
| Lord of the Rings | [Movie,Game,Book] |     1 |    1 |    1 |
| Alien             | [Movie,Game]      |     1 |    1 |    0 |
| Fight Club        | [Movie,Book]      |     1 |    0 |    1 |
| Scar Face         | [Movie]           |     1 |    0 |    0 |
| God of War        | [Game]            |     0 |    1 |    0 |
| Tomb Raider       | [Movie,Game]      |     1 |    1 |    0 |
| Borderlands       | [Game]            |     0 |    1 |    0 |
| Ulysses           | [Book]            |     0 |    0 |    1 |
+-------------------+-------------------+-------+------+------+

没有完全转换我的数据吗?基本上,我想在ID中找到所有重复的条目并将它们连接在一起,以便给定唯一ID的所有类型都放在一个位置(理想情况下在单个记录的列表中),然后在一个位置中对其进行一次热编码。我可以在同一行中看到TYPE的所有true或false值,并与(现在)唯一ID对齐。

3 个答案:

答案 0 :(得分:3)

您可以这样做:

(pd.concat( (pd.get_dummies(df['Type']), df), axis=1, sort=False)
   .groupby('ID', as_index=False, sort=False)
   .agg({'TYPE': list, 'Movie':'sum', 'Game':'sum', 'Book':'sum'})
)

输出:

                  ID                 TYPE  Movie  Game  Book
0  Lord of the Rings  [Movie, Book, Game]      1     1     1
1              Alien        [Movie, Game]      1     1     0
2         Fight Club        [Book, Movie]      1     0     1
3          Scar Face              [Movie]      1     0     0
4         God of War               [Game]      0     1     0
5        Tomb Raider        [Movie, Game]      1     1     0
6        Borderlands               [Game]      0     1     0
7            Ulysses               [Book]      0     0     1

答案 1 :(得分:2)

您可以在get_dummiesgroupby()之后使用str.join()

final=df.groupby('ID',sort=False).agg(list)
final.assign(**final['TYPE'].str.join('|').str.get_dummies()).reset_index()

                  ID                 TYPE  Book  Game  Movie
0  Lord of the Rings  [Movie, Book, Game]     1     1      1
1              Alien        [Movie, Game]     0     1      1
2         Fight Club        [Book, Movie]     1     0      1
3          Scar Face              [Movie]     0     0      1
4         God of War               [Game]     0     1      0
5        Tomb Raider        [Movie, Game]     0     1      1
6        Borderlands               [Game]     0     1      0
7            Ulysses               [Book]     1     0      0

答案 2 :(得分:2)

MultiLableBinarizer

from sklearn.preprocessing import MultiLabelBinarizer

final = df.groupby('ID', as_index=False, sort=False).agg(list)

mlb = MultiLabelBinarizer()
a = mlb.fit_transform(final.TYPE)
final.assign(**dict(zip(mlb.classes_, a.T)))

                  ID                 TYPE  Book  Game  Movie
0  Lord of the Rings  [Movie, Book, Game]     1     1      1
1              Alien        [Movie, Game]     0     1      1
2         Fight Club        [Book, Movie]     1     0      1
3          Scar Face              [Movie]     0     0      1
4         God of War               [Game]     0     1      0
5        Tomb Raider        [Movie, Game]     0     1      1
6        Borderlands               [Game]     0     1      0
7            Ulysses               [Book]     1     0      0

value_counts

df.groupby('ID', sort=False).pipe(
    lambda g: g.agg(list).join(g.TYPE.value_counts().unstack(fill_value=0))
).reset_index()