我有一个数据集,其中包含电影标题以及其所属的不同流派。每部电影都有不止一种类型。因此,对于整个数据集,我想找到存在的唯一类型的总数。
我无法使用df.unique()
,因为它是DataFrame本身每一列中的列表。
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
5 6 Heat (1995) Action|Crime|Thriller
6 7 Sabrina (1995) Comedy|Romance
7 8 Tom and Huck (1995) Adventure|Children
8 9 Sudden Death (1995) Action
9 10 GoldenEye (1995) Action|Adventure|Thriller
10 11 American President, The (1995) Comedy|Drama|Romance
11 12 Dracula: Dead and Loving It (1995) Comedy|Horror
12 13 Balto (1995) Adventure|Animation|Children
13 14 Nixon (1995) Drama
14 15 Cutthroat Island (1995) Action|Adventure|Romance
15 16 Casino (1995) Crime|Drama
16 17 Sense and Sensibility (1995) Drama|Romance
17 18 Four Rooms (1995) Comedy
18 19 Ace Ventura: When Nature Calls (1995) Comedy
19 20 Money Train (1995) Action|Comedy|Crime|Drama|Thriller
20 21 Get Shorty (1995) Comedy|Crime|Thriller
21 22 Copycat (1995) Crime|Drama|Horror|Mystery|Thriller
22 23 Assassins (1995) Action|Crime|Thriller
23 24 Powder (1995) Drama|Sci-Fi
24 25 Leaving Las Vegas (1995) Drama|Romance
25 26 Othello (1995) Drama
26 27 Now and Then (1995) Children|Drama
27 28 Persuasion (1995) Drama|Romance
28 29 City of Lost Children, The (Cité des enfants p...
这是电影的数据集。
在“类型”列下,我想将Action|Comedy|Crime|Drama|Thriller
分为动作,喜剧,犯罪,戏剧,惊悚片。
对于现在作为DataFrame的整个数据集,我也想找到唯一的流派。
答案 0 :(得分:0)
您可以按照以下步骤进行操作:
df = pd.DataFrame({'title':['Toy Story (1995)','Jumanji (1995)','Grumpier Old Men (1995)'],
'genres':['Adventure|Animation|Children|Comedy|Fantasy','Adventure|Children|Fantasy','Comedy|Romance']})
a = list(set([y for x in df['genres'] for y in x.split('|')]))
print(a)
输出:
['Animation', 'Comedy', 'Children', 'Fantasy', 'Adventure', 'Romance']
答案 1 :(得分:0)
尝试使用这种方法:
temp = df.genres.str.split("|").tolist() # this will return a list of lists for all the genres
import functools
import operator
unique_genres = set(functools.reduce(operator.concat, temp)) #this will flatten the list of lists and ultimately call the set to get the unique genres. Use len to get the number of unique genres afterwards
答案 2 :(得分:0)
尝试以下操作:
df = pda.read_csv('movies.csv')
df['genres'] = df['genres'].apply(lambda x: x.strip().split('|'))
df['count'] = df['genres'].apply(lambda y: len(y))
print(df)
OUTPUT :
movie Id ... genres count
0 1 ... [Adventure, Animation, Children, Comedy, Fantasy] 5
1 2 ... [Adventure, Children, Fantasy] 3
2 3 ... [Comedy, Romance] 2
3 4 ... [Comedy, Drama, Romance] 3
4 5 ... [Comedy] 1
5 6 ... [Action, Crime, Thriller] 3