聚合多个重复项并计算它们的平均值

时间:2018-03-30 22:02:53

标签: r dplyr aggregate

假设我们在他们尊重的UserID中有一个带有重复的DF但具有不同的namings,当然也可以是重复的。

DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))

目的是分别汇总和计算UserID及其名称的均值和标准差。一个理想的输出示例:

UserID  Name     Class    Scoring_mean  Scoring_std
101     Ed       Junior   12.5          3
101     Hank     Junior   24.67         11.62
102     Sandy    High     24.75         6.29
102     Jessica  High     24.25         1.5

因此我的问题是:

  • 根据UserID聚合名称有哪些选项,而不会丢失信息(Hank被强制转换为Ed等,与summarize()或mutate()一样)

以我的思维方式,R必须检查哪个Name对应于UserID,以及是否匹配;汇总并计算平均值&amp;标准偏差,但我无法使用dplyr在R中工作。

与此同时,我找不到与此问题有些相关的任何其他帖子,如:

  1. How to calculate the mean of specific rows in R?
  2. Subtract pairs of columns based on matching column
  3. Calculating mean when 2 conditions need met in R
  4. average between duplicated rows in R

2 个答案:

答案 0 :(得分:1)

如何计算摘要统计信息,然后将结果加入到初始数据框中。像这样:

DF <- data.frame(ID=c(101,101,101,101,101,102,102,102,102),
                 Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
                 Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
                 Scoring=c(11,15,18,18,12,20,22,25,26), Other_Scores=c(15,9,34,23,43,23,34,23,23))


DF2 <- DF %>% group_by(Name) %>%
  summarise(scoring_mean=mean(Scoring), scoring_sd = sd(Scoring)) %>%
  left_join(DF[,c(1,2,3)], by="Name")

给予:

# A tibble: 9 x 5
  Name    scoring_mean scoring_sd    ID Class 
  <fct>          <dbl>      <dbl> <dbl> <fct> 
1 Ed              13.0      2.83   101. Junior
2 Ed              13.0      2.83   101. Junior
3 Hank            16.0      3.46   101. Junior
4 Hank            16.0      3.46   101. Junior
5 Hank            16.0      3.46   101. Junior
6 Jessica         25.5      0.707  102. Mid   
7 Jessica         25.5      0.707  102. Mid   
8 Sandy           21.0      1.41   102. High  
9 Sandy           21.0      1.41   102. High 

答案 1 :(得分:0)

这是一个tidyverse选项,它使用一些重新整形来创建一列分数,然后进行一些分组以获取摘要统计信息:

DF <- data.frame(
ID=c(101,101,101,101,101,102,102,102,102),
Name=c("Ed","Ed","Hank","Hank","Hank","Sandy","Sandy","Jessica","Jessica"),
Class=c("Junior","Junior","Junior","Junior", "Junior","High","High","Mid","Mid"),
Scoring=c(11,15,18,18,12,20,22,25,26), 
Other_Scores=c(15,9,34,23,43,23,34,23,23)
)

library(tidyverse)

DF %>%
  gather(score_type, score, Scoring, Other_Scores) %>%  # reshape score columns
  group_by(ID, Name, Class) %>%                         # group by combinations
  summarise(scoring_mean = mean(score),                 # get summary stats
            scoring_sd = sd(score)) %>%
  ungroup()                                             # forget the grouping

# # A tibble: 4 x 5
#       ID Name    Class  scoring_mean scoring_sd
#    <dbl> <fct>   <fct>         <dbl>      <dbl>
# 1  101. Ed      Junior         12.5       3.00
# 2  101. Hank    Junior         24.7      11.6 
# 3  102. Jessica Mid            24.2       1.50
# 4  102. Sandy   High           24.8       6.29