我有一个看起来像这样的数据框
> head(All_data,10) %>% select(movieId, year, Genre, rate, voteCount)
movieId year Genre rate voteCount
6 2019 1954 adventure 4.255074 13994
14 908 1959 adventure 4.205228 19013
17 5618 2001 adventure 4.202589 20855
20 6016 2002 adventure 4.187873 19947
22 3030 1961 adventure 4.180265 4155
35 1136 1975 adventure 4.155154 39058
43 1196 1980 adventure 4.142536 61672
48 1198 1981 adventure 4.134614 59693
51 260 1977 adventure 4.132299 77045
53 1197 1987 adventure 4.131257 40931
数据大约有10000行。我想计算每种类型的每年平均费率,然后创建一个数据框以获取每年的最大费率并获取该值的类型。
总的来说,我想找到每年评级最高的流派。有简单的方法吗?
我尝试了
All_data %>% group_by(year,Genre) %>% summarise(meanRate = mean(rate*voteCount))
但是它为所有流派提供了相同的值(不计算每个(年,流派)元组的平均值)
随机数据样本:
> sample_n(All_data, 100) %>% select(movieId, year, Genre, rate, voteCount)
movieId year Genre rate voteCount
2258 122924 2016 adventure 3.326963 1936
11556 92535 2011 comedy 3.967404 1718
42542 2184 1955 mystery 3.633638 1534
14411 2735 1986 comedy 2.881314 4276
35587 4383 2000 thriller 3.464252 1049
56900 50068 2006 war 3.818182 2992
567 68358 2009 adventure 3.872353 14873
35755 2455 1986 thriller 3.403843 12178
27656 49278 2006 action 3.603452 4780
23671 3100 1992 drama 3.680055 6548
28728 103772 2013 action 3.232092 2443
60892 3549 1955 musical 3.744101 1780
37701 1200 1986 horror 3.993474 32943
11819 7084 1972 comedy 3.863470 1095
32883 7439 2004 crime 3.046360 2308
31031 81788 2010 crime 3.720724 1409
21026 140 1996 romance 3.309959 5633
23375 48696 2006 drama 3.769120 1739
24306 4713 1980 drama 3.483529 1700
24242 3179 1999 drama 3.504162 1682
14238 135 1996 comedy 2.997654 7674
12484 6296 2003 comedy 3.665834 3401
31289 69122 2009 crime 3.639332 14731
24694 52975 2007 drama 3.337893 1974
12463 2580 1999 comedy 3.671448 9376
13311 63859 2008 comedy 3.396336 3111
25156 7318 2004 drama 3.163142 4303
46419 122892 2015 sci-fi 3.594980 5817
12492 3516 1958 comedy 3.663376 1013
39730 91974 2012 horror 3.314297 1217
14129 3525 1984 comedy 3.066811 2073
64248 3671 1974 western 3.858894 12356
35710 104243 2013 thriller 3.418432 1747
18948 60069 2008 romance 4.008884 24538
25547 47044 2006 drama 2.941546 1591
2504 2 1995 adventure 3.236953 26060
28750 653 1996 action 3.219821 16609
12083 2791 1980 comedy 3.787693 21907
31017 2952 1996 crime 3.723631 1972
23673 5464 2002 drama 3.679694 10081
23342 4211 1990 drama 3.777429 1276
25157 5107 2002 drama 3.162754 1278
20391 85 1995 romance 3.538339 2830
23783 53318 2006 drama 3.645732 1441
28674 74685 2010 action 3.256219 1005
23899 5380 2002 drama 3.607206 2262
40237 1334 1958 horror 3.110867 2954
13634 2407 1985 comedy 3.277061 12472
29650 3316 2000 action 2.667995 2634
25578 8870 2004 drama 2.921805 1573
10292 5159 1992 children 3.106359 1321
1699 118696 2014 adventure 3.530134 3949
27669 7827 2002 action 3.601378 1016
62710 1064 1996 musical 3.104600 3609
20002 281 1994 romance 3.658513 4088
14254 276 1994 comedy 2.982415 5715
60734 4642 2000 musical 3.794605 2836
24772 2001 1989 drama 3.309075 15582
17491 30793 2005 fantasy 3.233414 11606
21221 33679 2005 romance 3.238069 11776
28778 93363 2012 action 3.208293 2074
21687 3452 2000 romance 3.022416 5041
2889 8974 2004 adventure 3.069362 1175
31825 61024 2008 crime 3.471492 4981
9623 2414 1985 children 3.373310 2810
14423 1432 1997 comedy 2.869686 1435
1559 2857 1968 adventure 3.573637 5174
35832 45 1995 thriller 3.372716 9632
21926 1100 1990 romance 2.858228 6253
1080 38038 2005 adventure 3.720714 5711
28526 2013 1972 action 3.305879 5919
27891 3726 1976 action 3.535299 1119
14645 6338 2003 comedy 2.694928 1380
22610 3090 1987 drama 4.059865 1186
4117 152081 2016 animation 3.950761 5849
38952 80831 2010 horror 3.589114 1038
13973 135887 2015 comedy 3.135554 1741
41040 1971 1988 horror 2.415489 1982
29432 4310 2001 action 2.844663 11327
25271 37727 2005 drama 3.104432 2437
27406 62849 2008 action 3.682334 2468
21589 3269 1992 romance 3.078978 3387
33840 2208 1938 thriller 4.061708 2447
24077 2321 1998 drama 3.559651 15666
53741 2693 1997 documentary 3.642332 1904
44218 2505 1999 mystery 2.975070 4653
26508 69524 1989 action 3.976781 1249
12916 61323 2008 comedy 3.534853 7087
9479 62376 2008 children 3.425710 1198
29333 135567 2016 action 2.916914 1011
18910 1244 1979 romance 4.029441 9986
12944 56152 2007 comedy 3.523864 3520
24111 2348 1986 drama 3.548119 2525
19788 2291 1990 romance 3.722712 25093
23469 54997 2007 drama 3.740131 6561
28341 61132 2008 action 3.373531 6128
60396 7836 1970 musical 3.898075 1143
16844 45722 2006 fantasy 3.473871 15079
37357 519 1993 thriller 2.243944 5821
32349 2110 1982 crime 3.278630 2672
答案 0 :(得分:3)
您可以使用几个dplyr
函数来完成此操作。按年份分组,然后进行流派,并通过以投票数加权的收视率平均值来进行汇总(无需费心计算)。
有趣的事实:dplyr::group_by
使用层次结构-如果按多个变量分组,则对分组的数据帧执行汇总操作,然后进行另一种操作,例如mutate
,则可以将级别降低分组。这就是为什么我在流派之前按年份分组,所以我不必重做任何分组。当我呼叫top_n
(其中有一个mutate
呼叫)时,它排在第一行仅按年份。这些类型的摘要非常方便。
作为支票,unique(df$year)
表明您有43个独特的年份;这个汇总的数据帧有43行。
library(tidyverse)
df %>%
group_by(year, Genre) %>%
summarise(mean_rating = weighted.mean(rate, w = voteCount)) %>%
top_n(1, mean_rating)
#> # A tibble: 43 x 3
#> # Groups: year [43]
#> year Genre mean_rating
#> <int> <chr> <dbl>
#> 1 1938 thriller 4.06
#> 2 1955 musical 3.74
#> 3 1958 comedy 3.66
#> 4 1968 adventure 3.57
#> 5 1970 musical 3.90
#> 6 1972 comedy 3.86
#> 7 1974 western 3.86
#> 8 1976 action 3.54
#> 9 1979 romance 4.03
#> 10 1980 comedy 3.79
#> # ... with 33 more rows
由reprex package(v0.2.0)于2018-07-10创建。
答案 1 :(得分:2)
在@camille的答案的帮助下进行了更新
为时已晚,您发布了更新的数据。这是我的模拟数据返回的结果。
set.seed(1)
df <- data.frame(
genres = rep(c("Adventure", "Fiction", "Comedy"), 4),
year = rep(c(1990, 1991), 6),
rate = rnorm(12),
count = floor(runif(12) * 10)
)
df %>%
group_by(year, genres) %>%
summarise(mean_rate = sum(rate * count) / sum(count)) %>%
top_n(1, mean_rate)
# A tibble: 2 x 3
# Groups: year [2]
year genres mean_rate
<dbl> <fctr> <dbl>
1 1990 Fiction 0.9206445
2 1991 Adventure 1.1201135
请注意,示例中的均值计算错误。例如。比率= {1,2},计数= {10,20}。您将计算出mean(1 * 10 + 2 * 20)=25。但是它是(1 * 10 + 2 * 20)/ 30 = 50/30。
修改
使用您的数据。
movies %>%
mutate(Genre = as.factor(Genre)) %>%
group_by(year, Genre) %>%
summarise(mean_rate = sum(rate * voteCount) / sum(voteCount)) %>%
top_n(1, mean_rate)
# A tibble: 43 x 3
# Groups: year [43]
year Genre mean_rate
<dbl> <fctr> <dbl>
1 1938 thriller 4.061708
2 1955 musical 3.744101
3 1958 comedy 3.663376
4 1968 adventure 3.573637
5 1970 musical 3.898075
6 1972 comedy 3.863470
7 1974 western 3.858894
8 1976 action 3.535299
9 1979 romance 4.029441
10 1980 comedy 3.787693
# ... with 33 more rows