通过数据帧中的组获取​​列的最大值

时间:2018-07-10 21:30:28

标签: r dataframe dplyr

我有一个看起来像这样的数据框

> head(All_data,10) %>% select(movieId, year, Genre, rate, voteCount)
   movieId year     Genre     rate voteCount
6     2019 1954 adventure 4.255074     13994
14     908 1959 adventure 4.205228     19013
17    5618 2001 adventure 4.202589     20855
20    6016 2002 adventure 4.187873     19947
22    3030 1961 adventure 4.180265      4155
35    1136 1975 adventure 4.155154     39058
43    1196 1980 adventure 4.142536     61672
48    1198 1981 adventure 4.134614     59693
51     260 1977 adventure 4.132299     77045
53    1197 1987 adventure 4.131257     40931

数据大约有10000行。我想计算每种类型的每年平均费率,然后创建一个数据框以获取每年的最大费率并获取该值的类型。

总的来说,我想找到每年评级最高的流派。有简单的方法吗?

我尝试了

All_data %>% group_by(year,Genre) %>% summarise(meanRate = mean(rate*voteCount))

但是它为所有流派提供了相同的值(不计算每个(年,流派)元组的平均值)

随机数据样本:

> sample_n(All_data, 100) %>% select(movieId, year, Genre, rate, voteCount)
      movieId year       Genre     rate voteCount
2258   122924 2016   adventure 3.326963      1936
11556   92535 2011      comedy 3.967404      1718
42542    2184 1955     mystery 3.633638      1534
14411    2735 1986      comedy 2.881314      4276
35587    4383 2000    thriller 3.464252      1049
56900   50068 2006         war 3.818182      2992
567     68358 2009   adventure 3.872353     14873
35755    2455 1986    thriller 3.403843     12178
27656   49278 2006      action 3.603452      4780
23671    3100 1992       drama 3.680055      6548
28728  103772 2013      action 3.232092      2443
60892    3549 1955     musical 3.744101      1780
37701    1200 1986      horror 3.993474     32943
11819    7084 1972      comedy 3.863470      1095
32883    7439 2004       crime 3.046360      2308
31031   81788 2010       crime 3.720724      1409
21026     140 1996     romance 3.309959      5633
23375   48696 2006       drama 3.769120      1739
24306    4713 1980       drama 3.483529      1700
24242    3179 1999       drama 3.504162      1682
14238     135 1996      comedy 2.997654      7674
12484    6296 2003      comedy 3.665834      3401
31289   69122 2009       crime 3.639332     14731
24694   52975 2007       drama 3.337893      1974
12463    2580 1999      comedy 3.671448      9376
13311   63859 2008      comedy 3.396336      3111
25156    7318 2004       drama 3.163142      4303
46419  122892 2015      sci-fi 3.594980      5817
12492    3516 1958      comedy 3.663376      1013
39730   91974 2012      horror 3.314297      1217
14129    3525 1984      comedy 3.066811      2073
64248    3671 1974     western 3.858894     12356
35710  104243 2013    thriller 3.418432      1747
18948   60069 2008     romance 4.008884     24538
25547   47044 2006       drama 2.941546      1591
2504        2 1995   adventure 3.236953     26060
28750     653 1996      action 3.219821     16609
12083    2791 1980      comedy 3.787693     21907
31017    2952 1996       crime 3.723631      1972
23673    5464 2002       drama 3.679694     10081
23342    4211 1990       drama 3.777429      1276
25157    5107 2002       drama 3.162754      1278
20391      85 1995     romance 3.538339      2830
23783   53318 2006       drama 3.645732      1441
28674   74685 2010      action 3.256219      1005
23899    5380 2002       drama 3.607206      2262
40237    1334 1958      horror 3.110867      2954
13634    2407 1985      comedy 3.277061     12472
29650    3316 2000      action 2.667995      2634
25578    8870 2004       drama 2.921805      1573
10292    5159 1992    children 3.106359      1321
1699   118696 2014   adventure 3.530134      3949
27669    7827 2002      action 3.601378      1016
62710    1064 1996     musical 3.104600      3609
20002     281 1994     romance 3.658513      4088
14254     276 1994      comedy 2.982415      5715
60734    4642 2000     musical 3.794605      2836
24772    2001 1989       drama 3.309075     15582
17491   30793 2005     fantasy 3.233414     11606
21221   33679 2005     romance 3.238069     11776
28778   93363 2012      action 3.208293      2074
21687    3452 2000     romance 3.022416      5041
2889     8974 2004   adventure 3.069362      1175
31825   61024 2008       crime 3.471492      4981
9623     2414 1985    children 3.373310      2810
14423    1432 1997      comedy 2.869686      1435
1559     2857 1968   adventure 3.573637      5174
35832      45 1995    thriller 3.372716      9632
21926    1100 1990     romance 2.858228      6253
1080    38038 2005   adventure 3.720714      5711
28526    2013 1972      action 3.305879      5919
27891    3726 1976      action 3.535299      1119
14645    6338 2003      comedy 2.694928      1380
22610    3090 1987       drama 4.059865      1186
4117   152081 2016   animation 3.950761      5849
38952   80831 2010      horror 3.589114      1038
13973  135887 2015      comedy 3.135554      1741
41040    1971 1988      horror 2.415489      1982
29432    4310 2001      action 2.844663     11327
25271   37727 2005       drama 3.104432      2437
27406   62849 2008      action 3.682334      2468
21589    3269 1992     romance 3.078978      3387
33840    2208 1938    thriller 4.061708      2447
24077    2321 1998       drama 3.559651     15666
53741    2693 1997 documentary 3.642332      1904
44218    2505 1999     mystery 2.975070      4653
26508   69524 1989      action 3.976781      1249
12916   61323 2008      comedy 3.534853      7087
9479    62376 2008    children 3.425710      1198
29333  135567 2016      action 2.916914      1011
18910    1244 1979     romance 4.029441      9986
12944   56152 2007      comedy 3.523864      3520
24111    2348 1986       drama 3.548119      2525
19788    2291 1990     romance 3.722712     25093
23469   54997 2007       drama 3.740131      6561
28341   61132 2008      action 3.373531      6128
60396    7836 1970     musical 3.898075      1143
16844   45722 2006     fantasy 3.473871     15079
37357     519 1993    thriller 2.243944      5821
32349    2110 1982       crime 3.278630      2672

2 个答案:

答案 0 :(得分:3)

您可以使用几个dplyr函数来完成此操作。按年份分组,然后进行流派,并通过以投票数加权的收视率平均值来进行汇总(无需费心计算)。

有趣的事实:dplyr::group_by使用层次结构-如果按多个变量分组,则对分组的数据帧执行汇总操作,然后进行另一种操作,例如mutate,则可以将级别降低分组。这就是为什么我在流派之前按年份分组,所以我不必重做任何分组。当我呼叫top_n(其中有一个mutate呼叫)时,它排在第一行仅按年份。这些类型的摘要非常方便。

作为支票,unique(df$year)表明您有43个独特的年份;这个汇总的数据帧有43行。

library(tidyverse)

df %>%
  group_by(year, Genre) %>%
  summarise(mean_rating = weighted.mean(rate, w = voteCount)) %>%
  top_n(1, mean_rating)
#> # A tibble: 43 x 3
#> # Groups:   year [43]
#>     year Genre     mean_rating
#>    <int> <chr>           <dbl>
#>  1  1938 thriller         4.06
#>  2  1955 musical          3.74
#>  3  1958 comedy           3.66
#>  4  1968 adventure        3.57
#>  5  1970 musical          3.90
#>  6  1972 comedy           3.86
#>  7  1974 western          3.86
#>  8  1976 action           3.54
#>  9  1979 romance          4.03
#> 10  1980 comedy           3.79
#> # ... with 33 more rows

reprex package(v0.2.0)于2018-07-10创建。

答案 1 :(得分:2)

在@camille的答案的帮助下进行了更新

为时已晚,您发布了更新的数据。这是我的模拟数据返回的结果。

set.seed(1)
df <- data.frame(
  genres = rep(c("Adventure", "Fiction", "Comedy"), 4),
  year = rep(c(1990, 1991), 6),
  rate = rnorm(12),
  count = floor(runif(12) * 10)
)


df %>% 
  group_by(year, genres) %>%
  summarise(mean_rate = sum(rate * count) / sum(count)) %>%
  top_n(1, mean_rate)

# A tibble: 2 x 3
# Groups:   year [2]
   year    genres mean_rate
  <dbl>    <fctr>     <dbl>
1  1990   Fiction 0.9206445
2  1991 Adventure 1.1201135

请注意,示例中的均值计算错误。例如。比率= {1,2},计数= {10,20}。您将计算出mean(1 * 10 + 2 * 20)=25。但是它是(1 * 10 + 2 * 20)/ 30 = 50/30。

修改

使用您的数据。

movies %>% 
  mutate(Genre = as.factor(Genre)) %>%
  group_by(year, Genre) %>%
  summarise(mean_rate = sum(rate * voteCount) / sum(voteCount)) %>%
  top_n(1, mean_rate)

# A tibble: 43 x 3
# Groups:   year [43]
    year     Genre mean_rate
   <dbl>    <fctr>     <dbl>
 1  1938  thriller  4.061708
 2  1955   musical  3.744101
 3  1958    comedy  3.663376
 4  1968 adventure  3.573637
 5  1970   musical  3.898075
 6  1972    comedy  3.863470
 7  1974   western  3.858894
 8  1976    action  3.535299
 9  1979   romance  4.029441
10  1980    comedy  3.787693
# ... with 33 more rows