Question

我正在努力寻找一种更好，更快的方法来汇总由加权平均值组成的汇总统计表。使用dplyr汇总然后bind_rows我最终得到一个像这样的表。这些数字是简单平均值。计算每个组的每个因子的平均值。

Dataframe：au.scores

         AU    AUDIT     CORC      GOV      PPS     TMSC    TRAIN
1 Group1 2.833333 2.000000 2.733333 2.000000 1.750000 2.333333
2 Group2 2.833333 0.000000 2.733333 2.000000 1.750000 2.333333
3 Group3 1.833333 2.533333 2.466667 2.000000 2.500000 2.166667
4 Group4 3.000000 2.733333 2.200000 2.666667 1.583333 2.666667
5 Group5 2.625000 1.816667 2.533333 2.166667 1.895833 2.375000

在此之后，我需要得出一个加权分数，该分数结合了每个变量和组1和1的元素。 2，3,4,5。即整体。组1为Group1 + Group4 + Group5，Group2为Group2 + Group4 + Group5，Group3为Group3 + Group4 + Group5因子。

group1.overall <- data.frame(
  group1.gov = (au.scores[3, 4] * .30) * .33 + (au.scores[1, 4] * .30) * .33 +
    (au.scores[2, 4] * .30) * .33,
  group1.corc = (au.scores[3, 3] * .30) * .33 + (au.scores[1, 3] * .1) * .33 +
    (au.scores[2, 3] * .1) * .33,
  group1.tmsc = (au.scores[3, 6] * .30) * .33 + (au.scores[1, 6] * .30) * .33 +
    (au.scores[2, 6] * .30) * .33,
  group1.audit = (au.scores[3, 2] * .30) * .33 + (au.scores[1, 2] * .30) * .33 +
    (au.scores[2, 2] * .30) * .33,
  group1.pps = (au.scores[3, 5] * .30) * .33 + (au.scores[1, 5] * .30) * .33 +
    (au.scores[2, 5] * .30) * .33,
  group1.train = (au.scores[3, 7] * .30) * .33 + (au.scores[1, 7] * .30) * .33 +
    (au.scores[2, 7] * .30) * .33
)

可生产

  group1.gov group1.corc group1.tmsc group1.audit group1.pps group1.train
1  0.7854   0.3168    0.594    0.7425   0.594    0.6765

问题是否有更快的方法来创建总体得分的data.frame？

像

这样的东西

Group_Num / Gov / Corc / Tmsc / Audit / PPS / Train / Overall
Group1 / 0.78 / 0.31 / 0.59 / 0.74 / 0.59 / 0.67 / <- sum these 
Group2 / 0.66 / 0.23 / 0.44 / 0.66 / 0.22 / 0.43 / <- sum these
Group3 / 0.12 / 0.55 / 0.22 / 0.33 / 0.11 / 0.55 / <- sum these

等

Answer 1

总体。组1是组1 +组4 +组5，组2是组2 +组4 +组5 和Group3是Group3 + Group4 + Group5因素。

有关如何计算总体得分的说明与group1.overall的公式不同，后者使用Group1＆lt; - Group1 + Group2 + Group3。在下面的方法中，我将按描述进行操作。如有必要，您可以进行调整：

library(dplyr); library(tidyr); library(tibble)

# read in au.scores data frame
au.scores <- read.table(text = "AU    AUDIT     CORC      GOV      PPS     TMSC    TRAIN
Group1 2.833333 2.000000 2.733333 2.000000 1.750000 2.333333
Group2 2.833333 0.000000 2.733333 2.000000 1.750000 2.333333
Group3 1.833333 2.533333 2.466667 2.000000 2.500000 2.166667
Group4 3.000000 2.733333 2.200000 2.666667 1.583333 2.666667
Group5 2.625000 1.816667 2.533333 2.166667 1.895833 2.375000", header = T)

# create table of weights (these are dummy weights since there's insufficient details in the question)
weight.table <- tribble(
  ~AU, ~GOV, ~CORC, ~TMSC, ~AUDIT, ~PPS, ~TRAIN,
  "Group1",.30,.10,.30,.30,.30,.30,
  "Group2",.30,.10,.30,.30,.30,.30,
  "Group3",.30,.10,.30,.30,.30,.30,
  "Group4",.30,.30,.30,.30,.30,.30,
  "Group5",.30,.10,.30,.30,.30,.30
)

# arrange columns in au.scores to match order of columns in weight.table
au.scores <- au.scores %>% arrange(AU, GOV, CORC, TMSC, AUDIT, PPS, TRAIN)

# calculate weighted scores
au.scores.weighted <- au.scores[,-1] * weight.table[,-1]
au.scores.weighted$AU <- au.scores$AU

# calculate scores for each group
au.scores.weighted <- au.scores.weighted %>%
  gather(category, weighted.score, -AU) %>%
  group_by(category) %>%
  arrange(AU) %>%
  summarise(group1 = weighted.mean(weighted.score, c(1,0,0,1,1)) * 3 * 0.33,
            group2 = weighted.mean(weighted.score, c(0,1,0,1,1)) * 3 * 0.33,
            group3 = weighted.mean(weighted.score, c(0,0,1,1,1)) * 3 * 0.33) %>%
  ungroup()

# rearrange result & calculate overall sum for each group
au.scores.weighted <- au.scores.weighted %>%
  gather(group, score, -category) %>%
  spread(category, score) %>%
  select(group, GOV, CORC, TMSC, AUDIT, PPS, TRAIN) %>%
  mutate(Overall = GOV + CORC + TMSC + AUDIT + PPS + TRAIN)

# A tibble: 3 × 8
   group       GOV    CORC      TMSC    AUDIT       PPS     TRAIN  Overall
   <chr>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
1 group1 0.7391999 0.39655 0.5176874 0.837375 0.6765001 0.7301250 3.897437
2 group2 0.7391999 0.33055 0.5176874 0.837375 0.6765001 0.7301250 3.831437
3 group3 0.7128000 0.41415 0.5919374 0.738375 0.6765001 0.7136251 3.847388

修改以根据OP的问题添加代码说明：

总结中向量顺序的意义是什么功能？ c（1,0,0,1,1））* 3 * 0.33＆amp;＆amp; c（0,1,0,1,1））* 3 * 0.33＆amp;＆amp; C（0,0,1,1,1））？

前面的步骤已经在每个类别中按顺序排列了组，因此在c(1, 0, 0, 1, 1)函数中使用权重weighted.mean相当于计算组1,4和＆amp;的平均值。 5，不使用组2＆amp; 3根本没有。 Ditto c(0,1,0,1,1) =第2,4组和第2组的平均值5，`c（0,0,1,1,1）=组3,4的平均值，＆amp; 5.我发现这比手动指定每个组更容易阅读/错误检查，这可以快速将组号埋在一堆文本中。

由此得出的平均值等于（组的和）/ 3，或（组的总和）* 0.3333333333333333 ...在十进制中，因为1/3是重复分数。由于您的原始公式使用（组的总和）* 0.33（在小数点后2位舍入），将均值乘以* 3 * 0.33将产生相同的结果。如果您更喜欢更精确的结果，则可以完全省略* 3 * 0.33部分。

有没有更快的方法来创建加权分数的数据框架？

1 个答案: