假设我在https://bpaste.net/show/05fa224794e4中有一个包含10000部电影的数据集,并且该数据集的一个摘要是
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime|Drama
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime|Thriller
tt0137523 Fight Club (1999) 1999 8.8 458173 139 mins. Drama|Mystery|Thriller
tt0133093 The Matrix (1999) 1999 8.7 448114 136 mins. Action|Adventure|Sci-Fi
tt1375666 Inception (2010) 2010 8.9 385149 148 mins. Action|Adventure|Sci-Fi|Thriller
tt0109830 Forrest Gump (1994) 1994 8.7 368994 142 mins. Comedy|Drama|Romance
tt0169547 American Beauty (1999) 1999 8.6 338332 122 mins. Drama
tt0499549 Avatar (2009) 2009 8.1 336855 162 mins. Action|Adventure|Fantasy|Sci-Fi
tt0108052 Schindler's List (1993) 1993 8.9 325888 195 mins. Biography|Drama|History|War
tt0080684 Star Wars: Episode V - The Empire Strikes Back (1980) 1980 8.8 320105 124 mins. Action|Adventure|Family|Sci-Fi
tt0372784 Batman Begins (2005) 2005 8.3 316613 140 mins. Action|Crime|Drama|Thriller
tt0114814 The Usual Suspects (1995) 1995 8.7 306624 106 mins. Crime|Mystery|Thriller
tt0102926 The Silence of the Lambs (1991) 1991 8.7 293081 118 mins. Crime|Thriller
tt0120338 Titanic (1997) 1997 7.4 284245 194 mins. Adventure|Drama|History|Romance
我有一段代码要加载我的数据集并对其进行一些更改
import pandas as pd
import numpy as np
headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
movies.head()
one_hot_encoding = movies["genres"].str.get_dummies(sep='|')
movies = pd.concat([movies, one_hot_encoding], axis=1)
movies_top_250 = movies.sort_values('score', ascending=False).head(250)
给出这个
也许我在想透视表?这里仅使用流派列的子集。
pd.pivot_table(movies_top_250, values=['votes', 'Action', 'Adult'], index='title', aggfunc=np.sum).sort_values('votes', ascending=False)
Action Adult votes
title
The Shawshank Redemption (1994) 0 0 619479
The Dark Knight (2008) 1 0 555122
Pulp Fiction (1994) 0 0 490065
The Godfather (1972) 0 0 474189
Fight Club (1999) 0 0 458173
The Lord of the Rings: The Fellowship of the Ri... 1 0 451263
The Matrix (1999) 1 0 448114
The Lord of the Rings: The Return of the King (... 1 0 428791
Inception (2010) 1 0 385149
The Lord of the Rings: The Two Towers (2002) 1 0 383113
Forrest Gump (1994) 0 0 368994
但这并不能说明哪种类型的票数最多。还有
movies.groupby('genres').score.mean()
返回类似
genres
Action 5.837500
Action|Adventure 6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy 7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi 6.100000
Action|Adventure|Biography|Crime|History|Western 6.300000
Action|Adventure|Biography|Drama|History 7.700000
因此,我对此并不能完全理解。对于第一个问题,我正在考虑获得类似的东西
Genre mean_score votes_sum
Action 7.837500 103237
Adventure 6.152381 103226
Animation 5.500000 103275
答案 0 :(得分:0)
import io
import numpy as np
import pandas as pd
colnames = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
data_url = 'https://bpaste.net/raw/05fa224794e4'
movies = pd.read_csv(data_url, sep="\t", header=None, names=colnames, encoding='UTF-8', index_col='imdbID')
还有一个有用的功能
def arg_nlargest(x, n, use_index=True):
if isinstance(x, pd.Series):
x = x.values
return np.argpartition(-x, n)[:n]
首先获得前250名电影:
top250_iloc = arg_nlargest(movies['score'], 250)
movies250 = movies.iloc[top250_iloc]
接下来,我们像您一样将每部电影的流派扩展到指标中
movies250_genre_inds = movies250["genres"].str.get_dummies(sep='|')
天真的方法是遍历指标列,为每种流派收集汇总。
genre_agg = {}
for genre in movies250_genre_inds.columns:
mask = movies250_genre_inds[genre].astype(bool)
aggregates = movies250.loc[mask].agg({'score': 'mean', 'votes': 'sum'})
genre_agg[genre] = aggregates.tolist()
genre_agg = pd.DataFrame.from_dict(genre_agg, orient='index', columns=['score_mean', 'votes_sum'])
genre3_iloc = arg_nlargest(genre_agg['score_mean'], 3)
genre3 = genre_agg.iloc[genre3_iloc].sort_values('score_mean', ascending=False)
答案 1 :(得分:0)
您可以使用以下 oneline 解决方案(仅将换行符转为漂亮格式):
movies = \
(movies.set_index(mv.columns.drop('genres',1).tolist())
.genres.str.split('|',expand=True)
.stack()
.reset_index()
.rename(columns={0:'genre'})
.loc[:,['genre','score','votes']]
.groupby('genre').agg({'score':['mean'], 'votes':['sum']})
)
score votes
mean sum
genre
Action 8.425714 7912508
Adventure 8.430000 7460632
Animation 8.293333 1769806
Biography 8.393750 2112875
Comedy 8.341509 3166269
...
主要问题是True
流程在类型上产生的多个one_hot_encoding
值。一部电影可以分配给一种或多种类型。因此,您不能按类型正确使用聚合方法。另一方面,按原样使用genres
字段将消除您在问题中显示的多个性别结果:
genres
Action 5.837500
Action|Adventure 6.152381
Action|Adventure|Animation|Comedy|Family|Fantasy 7.500000
Action|Adventure|Animation|Family|Fantasy|Sci-Fi 6.100000
Action|Adventure|Biography|Crime|History|Western 6.300000
Action|Adventure|Biography|Drama|History 7.700000
一种解决方法是,在找到多个性别时复制行。通过将split
与expand
方法设置为True
的组合,可以创建多个数据帧,然后将它们堆叠。例如,具有2种体裁的电影将出现在2个结果数据帧中,其中每个数据帧代表分配给每种体裁的电影。最后,在解析之后,您可以按性别汇总具有多个功能的信息。我将逐步解释:
加载数据:
import pandas as pd
import numpy as np
headers = ['imdbID', 'title', 'year', 'score', 'votes', 'runtime', 'genres']
movies = pd.read_csv("imdb_top_10000.txt", sep="\t", header=None, names=headers, encoding='UTF-8')
请注意,您在genres
字段中具有空值:
imdbID title year score votes runtime genres
7917 tt0990404 Chop Shop (2007) 2007 7.2 2104 84 mins. NaN
由于使用Pandas的聚合方法将忽略具有任何空值的行,并且在此字段上我们只有1部电影具有空值,因此可以手动设置(在Imdb上选中):
movies.loc[movies.genres.isnull(),"genres"] = "Drama"
现在,正如您已经显示的,我们需要按得分排名前250部电影:
movies = movies.sort_values('score', ascending=False).head(250)
仅将流派字段保留为列,将其他字段保留为索引。这是为了简化流派工作。
movies = movies.set_index(movies.columns.drop('genres',1).tolist())
genres
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime|Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Crime|Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. Western
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime|Thriller
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Comedy|Drama
(250, 1)
这将从N次拆分迭代中创建N个数据帧。
movies = movies.genres.str.split('|',expand=True)
0 \
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Crime
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Crime
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. Western
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Crime
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Comedy
1 \
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. None
tt0110912 Pulp Fiction (1994) 1994 9.0 490065 154 mins. Thriller
tt0252487 Outrageous Class (1975) 1975 9.0 9823 87 mins. Drama
...
现在,每部电影都有一个唯一的体裁值,如果分配了一种以上的体裁,一部电影可以有多于1行,您可以堆叠数据帧集。请注意,现在我们有超过250行(662行),但是有250部不同的电影。
movies = movies.stack()
imdbID title year score votes runtime
tt0111161 The Shawshank Redemption (1994) 1994 9.2 619479 142 mins. 0 Crime
1 Drama
tt0068646 The Godfather (1972) 1972 9.2 474189 175 mins. 0 Crime
1 Drama
tt0060196 The Good, the Bad and the Ugly (1966) 1966 9.0 195238 161 mins. 0 Western
dtype: object
(662,)
在聚合之前获取合适的数据结构:
# Multiple index to columns
movies = movies.reset_index()
# Name the new column for genre
movies = movies.rename(columns={0:'genre'})
# Only wanted fields to be aggregated
movies = movies.loc[:,['genre','score','votes']]
genre score votes
0 Crime 9.2 619479
1 Drama 9.2 619479
2 Crime 9.2 474189
3 Drama 9.2 474189
4 Western 9.0 195238
(662, 3)
根据您的要求,分数必须按均值汇总,投票应按总和汇总:
movies = movies.groupby('genres').agg({'score':['mean'], 'votes':['sum']})
score votes
mean sum
genre
Action 8.425714 7912508
Adventure 8.430000 7460632
Animation 8.293333 1769806
Biography 8.393750 2112875
Comedy 8.341509 3166269
(21, 2)