Question

我有一个数据集，其中包含电影标题以及其所属的不同流派。每部电影都有不止一种类型。因此，对于整个数据集，我想找到存在的唯一类型的总数。

我无法使用df.unique()，因为它是DataFrame本身每一列中的列表。

movieId title   genres
0   1   Toy Story (1995)    Adventure|Animation|Children|Comedy|Fantasy
1   2   Jumanji (1995)  Adventure|Children|Fantasy
2   3   Grumpier Old Men (1995) Comedy|Romance
3   4   Waiting to Exhale (1995)    Comedy|Drama|Romance
4   5   Father of the Bride Part II (1995)  Comedy
5   6   Heat (1995) Action|Crime|Thriller
6   7   Sabrina (1995)  Comedy|Romance
7   8   Tom and Huck (1995) Adventure|Children
8   9   Sudden Death (1995) Action
9   10  GoldenEye (1995)    Action|Adventure|Thriller
10  11  American President, The (1995)  Comedy|Drama|Romance
11  12  Dracula: Dead and Loving It (1995)  Comedy|Horror
12  13  Balto (1995)    Adventure|Animation|Children
13  14  Nixon (1995)    Drama
14  15  Cutthroat Island (1995) Action|Adventure|Romance
15  16  Casino (1995)   Crime|Drama
16  17  Sense and Sensibility (1995)    Drama|Romance
17  18  Four Rooms (1995)   Comedy
18  19  Ace Ventura: When Nature Calls (1995)   Comedy
19  20  Money Train (1995)  Action|Comedy|Crime|Drama|Thriller
20  21  Get Shorty (1995)   Comedy|Crime|Thriller
21  22  Copycat (1995)  Crime|Drama|Horror|Mystery|Thriller
22  23  Assassins (1995)    Action|Crime|Thriller
23  24  Powder (1995)   Drama|Sci-Fi
24  25  Leaving Las Vegas (1995)    Drama|Romance
25  26  Othello (1995)  Drama
26  27  Now and Then (1995) Children|Drama
27  28  Persuasion (1995)   Drama|Romance
28  29  City of Lost Children, The (Cité des enfants p...

这是电影的数据集。

在“类型”列下，我想将Action|Comedy|Crime|Drama|Thriller分为动作，喜剧，犯罪，戏剧，惊悚片。

对于现在作为DataFrame的整个数据集，我也想找到唯一的流派。

Answer 1

您可以按照以下步骤进行操作：

df = pd.DataFrame({'title':['Toy Story (1995)','Jumanji (1995)','Grumpier Old Men (1995)'],
                            'genres':['Adventure|Animation|Children|Comedy|Fantasy','Adventure|Children|Fantasy','Comedy|Romance']})


a = list(set([y for x in df['genres'] for y in x.split('|')]))
print(a)

输出：

['Animation', 'Comedy', 'Children', 'Fantasy', 'Adventure', 'Romance']

Answer 2

尝试使用这种方法：

temp = df.genres.str.split("|").tolist() # this will return a list of lists for all the genres
import functools
import operator

unique_genres = set(functools.reduce(operator.concat, temp)) #this will flatten the list of lists and ultimately call the set to get the unique genres. Use len to get the number of unique genres afterwards

Answer 3

尝试以下操作：

df = pda.read_csv('movies.csv')
df['genres'] = df['genres'].apply(lambda x: x.strip().split('|'))
df['count'] = df['genres'].apply(lambda y: len(y))
print(df)

OUTPUT :

   movie   Id  ...                                             genres count
     0    1  ...  [Adventure, Animation, Children, Comedy, Fantasy]     5
     1    2  ...                     [Adventure, Children, Fantasy]     3
     2    3  ...                                  [Comedy, Romance]     2
     3    4  ...                           [Comedy, Drama, Romance]     3
     4    5  ...                                           [Comedy]     1
     5    6  ...                          [Action, Crime, Thriller]     3

如何在python数据框中找到唯一列表项？

3 个答案: