Question

我有一个pandas数据框，其中一列包含每个实例的唯一字符串列表：

obj_movies['unique_genres'].head()

0    [Action, Fantasy, Adventure, Science Fiction]
1                     [Action, Fantasy, Adventure]
2                       [Action, Adventure, Crime]
3                 [Action, Drama, Thriller, Crime]
4             [Action, Science Fiction, Adventure]
Name: unique_genres, dtype: object

我想使用pandas get_dummies（）根据列表中的值创建布尔功能（添加到同一数据帧）。例如，功能＆＃39; Action_Movie＆＃39;所有前五个实例都是True（或值为1）。

为完成此任务，我从功能中包含的所有列表中创建了一组唯一值。使用for循环，对于每个影片标签功能（即集合中的唯一值），我然后使用我单独创建的布尔转换方法来创建基于方法结果的1或0的列表。最后，我简单地附加为新的熊猫系列。

然而，我认为必须有更快的方法！例如pandas df.isin（）方法怎么样？我也调查了这一点，但是当你传递一系列列表

时它似乎不起作用

这样做的最佳方法是什么？任何人都可以在网上推荐一个好的熊猫高级数据操作教程吗？

Answer 1

因此，如果您的列由列表组成，您确实可以在列上使用get_dummies进行一些转换（apply(pd.Series)，stack然后groupby）：

df_dummies = pd.get_dummies(obj_movies['unique_genres']
                                  .apply(pd.Series).stack()).groupby(level=0).sum()

然后将列添加到您之前的数据框中，使用join：

obj_movies = obj_movies.join(df_dummies)

你应该得到预期的输出

Answer 2

我相信，你需要：

df = pd.DataFrame({
    'movie':['a', 'b', 'c'],
    'genre':[['Action', 'Fantasy', 'Adventure', 'Science Fiction'],['Action', 'Fantasy', 'Adventure'],['Action', 'Adventure', 'Crime']]
})
dum = pd.get_dummies(df['genre'].apply(pd.Series).stack()).reset_index(1, drop=True)
dum.groupby(dum.index).sum()

输出：

   Action  Adventure  Crime  Fantasy  Science Fiction
0       1          1      0        1                1
1       1          1      0        1                0
2       1          1      1        0                0

然后，您可以使用以下方法轻松将这些虚拟替换回原始数据框：

df.merge(dum.groupby(dum.index).sum(), left_index=True, right_index=True).drop('genre', axis=1)

输出：

  movie  Action  Adventure  Crime  Fantasy  Science Fiction
0     a       1          1      0        1                1
1     b       1          1      0        1                0
2     c       1          1      1        0                0

使用Pandas Getdummies或isin从包含列表的功能创建Bool功能

2 个答案: