data = {
"Name": ["Brad Pitt", "Brad Pitt", "Brad Pitt", "Rowan Atkinson", "Leonardo DiCaprio", "Leonardo DiCaprio",
"Uma Thurman", "Uma Thurman", "John Travolta", "John Travolta", "Brad Pitt", "Brad Pitt",
"Leonardo DiCaprio", "Leonardo DiCaprio"],
"Movie": ["Moneyball", "Moneyball", "Mr. & Mrs. Smith", "The Lion King", "Inception", "Inception",
"Pulp fiction", "Pulp fiction", "Pulp fiction", "Pulp fiction", "Twelve Monkeys", "Twelve Monkeys",
"Titanic", "Titanic"],
"Rating": [7, 7, 7.8, 9, 8.2, 8.2, 7.9, 7.9, 7.9, 7.9, 8, 8, 9.1, 9.1],
"Votes": [70, 70, 120, 150, 110, 110, 85, 85, 85, 85, 60, 60, 170, 170],
"Genre": ["Biography", "Drama", "Comedy", "Animation", "Action", "Thriller",
"Crime", "Drama", "Crime", "Drama", "Mystery", "Triller",
"Drama", "Romance"]
}
import pandas as pd
films = pd.DataFrame(data)
films
我想应用一些操作使其看起来像这样:
在1)电影中,我为每个演员放了movie.count(),2)评分成为唯一电影的平均评分,并且3)演员对唯一电影的投票汇总。
请帮助弄清楚如何进行此转换。谢谢。
答案 0 :(得分:2)
首先,您可以按名称和电影分组以删除重复项,然后按名称分组以聚合其余部分:
In [25]: films.groupby(["Name", "Movie"]).first().reset_index().groupby("Name")
...: .agg({"Movie": "count", "Rating": "mean", "Votes": "sum"})
Out[25]:
Movie Rating Votes
Name
Brad Pitt 3 7.60 250
John Travolta 1 7.90 85
Leonardo DiCaprio 2 8.65 280
Rowan Atkinson 1 9.00 150
Uma Thurman 1 7.90 85
答案 1 :(得分:1)
我将首先处理重复项,然后进行分组,而不是使用嵌套的groupby。
%timeit films.drop_duplicates(['Movie', 'Name']).groupby(['Name']).agg({'Movie' : 'count', 'Rating' : 'mean', 'Votes' : 'sum'})
2.55 ms ± 122 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit films.groupby(["Name", "Movie"]).first().reset_index().groupby("Name").agg({"Movie": "count", "Rating": "mean", "Votes": "sum"})
6.92 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Movie Rating Votes
Name
Brad Pitt 3 7.60 250
John Travolta 1 7.90 85
Leonardo DiCaprio 2 8.65 280
Rowan Atkinson 1 9.00 150
Uma Thurman 1 7.90 85