使用plyr计算大于95%分位数的值时出错

时间:2016-02-29 05:42:54

标签: r subset plyr

我的数据结构如下:

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
                                      "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"),
                      Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
                      Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
                                    "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"),
                      Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA))

使用dplyr我通过以下代码应用滚动平均值(从2到4秒):

for (summaryFunction in c("mean")) {
  for ( i in seq(2, 4, by = 1)) {
    tempColumn <- Individ %>%
      group_by(Participant) %>%
      transmute(rollapply(Power,
                          width = i, 
                          FUN = summaryFunction, 
                          align = "right", 
                          fill = NA, 
                          na.rm = T))
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".")
    Individ <- bind_cols(Individ, tempColumn[2])
  }
}

我现在希望计算每个滚动平均值中每个Power的{​​{1}}的前5%。为了计算这个,我使用:

Participant

但是,我最终会找到一个列出Output = ddply(Individ, .(Participant, Condition), summarise, TwoSec <- Rolling.mean.2 > quantile(Rolling.mean.2 , 0.95, na.rm = TRUE)) TRUE的列。相反,我追随的是前5%的实际值。我该怎么做呢?是否还有一种更简单的方法来迭代每个滚动平均值列,按参与者和条件,找到每个列的前5%?

谢谢!

1 个答案:

答案 0 :(得分:1)

获得滚动数据表很好,这使得计算分位数的工作变得更加容易。

第1步:按参与者分组,条件,位置

Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", 
                                      "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"),
                      Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
                      Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", 
                                    "Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr"),
                      Location = c("Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home", 
                                   "Home", "Home", "Home", "Home", "Away", "Away", "Away", "Away", "Home", "Home", "Home", "Home"),
                      Power = c(400, 250, 180, 500, 300, 450, 600, 512, 300, 500, 450, 200, 402, 210, 130, 520, 310, 451, 608, 582, 390, 570, NA, NA))


library(dplyr)
library(zoo)
for (summaryFunction in c("mean")) {
  for ( i in seq(2, 4, by = 1)) {
    tempColumn <- Individ %>%
      group_by(Participant) %>%
      transmute(rollapply(Power,
                          width = i, 
                          FUN = summaryFunction, 
                          align = "right", 
                          fill = NA, 
                          na.rm = T))
    colnames(tempColumn)[2] <- paste("Rolling", summaryFunction, as.character(i), sep = ".")
    Individ <- bind_cols(Individ, tempColumn[2])
  }
}


Individ


     Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4
        (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)
1         Bill     1   Placebo     Home   400             NA             NA             NA
2         Bill     2   Placebo     Home   250            325             NA             NA
3         Bill     3   Placebo     Home   180            215       276.6667             NA
4         Bill     4   Placebo     Home   500            340       310.0000          332.5
5         Bill     1      Expr     Away   300            400       326.6667          307.5
6         Bill     2      Expr     Away   450            375       416.6667          357.5
7         Bill     3      Expr     Away   600            525       450.0000          462.5
8         Bill     4      Expr     Away   512            556       520.6667          465.5
9         Bill     1      Expr     Home   300            406       470.6667          465.5
10        Bill     2      Expr     Home   500            400       437.3333          478.0

获得所有7或8列(此数据集包括位置)后,它也回答了另一个问题,在新的个人数据集中,这是我为解决您的问题所做的工作。我百分百肯定有更清洁,更有效的方法来做到这一点,但这里有逻辑,它应该输出正常。

第2步:为群体获取分位数

library(plyr)
Individ[is.na(Individ)]<- 0
Top_percentiles <- ddply(Individ, 
                         c("Participant", "Condition", "Location"), 
                         summarise, 
                         Power2 = quantile(Rolling.mean.2, .95),
                         Power3 = quantile(Rolling.mean.3, .95),
                         Power4 = quantile(Rolling.mean.4, .95)
                         )

Top_percentiles

  Participant Condition Location  Power2   Power3  Power4
1        Bill      Expr     Away 551.350 510.0667 465.050
2        Bill      Expr     Home 464.650 465.6667 476.125
3        Bill   Placebo     Home 337.750 305.0000 282.625
4       Harry      Expr     Away 585.175 533.4000 485.425
5       Harry   Placebo     Home 322.150 280.7667 268.175
6        Paul      Expr     Home 556.500 556.5000 408.000

这是每组最高5%的门槛和相应的滚动平均值。

现在唯一要做的就是计算数据集中高于每个阈值的观察结果。

步骤3:将滚动平均列与原始数据集匹配

像这样的东西有点像我在修补。

Individ$Power2 <- Top_percentiles$Power2[match(Individ$Participant, Top_percentiles$Participant) &&  
                                         match(Individ$Condition, Top_percentiles$Condition) &&
                                         match(Individ$Location, Top_percentiles$Location)]

Individ$Power3 <- Top_percentiles$Power3[match(Individ$Participant, Top_percentiles$Participant) &&  
                                           match(Individ$Condition, Top_percentiles$Condition) &&
                                           match(Individ$Location, Top_percentiles$Location)]

Individ$Power4 <- Top_percentiles$Power4[match(Individ$Participant, Top_percentiles$Participant) &&  
                                           match(Individ$Condition, Top_percentiles$Condition) &&
                                           match(Individ$Location, Top_percentiles$Location)]


Individ


    Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4  Power2   Power3
        (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)   (dbl)    (dbl)
1         Bill     1   Placebo     Home   400              0         0.0000            0.0 551.350 510.0667
2         Bill     2   Placebo     Home   250            325         0.0000            0.0 464.650 465.6667
3         Bill     3   Placebo     Home   180            215       276.6667            0.0 337.750 305.0000
4         Bill     4   Placebo     Home   500            340       310.0000          332.5 585.175 533.4000
5         Bill     1      Expr     Away   300            400       326.6667          307.5 322.150 280.7667
6         Bill     2      Expr     Away   450            375       416.6667          357.5 556.500 556.5000
7         Bill     3      Expr     Away   600            525       450.0000          462.5 551.350 510.0667
8         Bill     4      Expr     Away   512            556       520.6667          465.5 464.650 465.6667
9         Bill     1      Expr     Home   300            406       470.6667          465.5 337.750 305.0000
10        Bill     2      Expr     Home   500            400       437.3333          478.0 585.175 533.4000

我的想法是将分位数列与Individual数据集相匹配。

第4步:过滤数据集

这应该让你想要的。

选项1:三个单独的数据集

top_percentile_2sec <- Individ %>% filter(Rolling.mean.2 >= Power2)
top_percentile_3sec <- Individ %>% filter(Rolling.mean.3 >= Power3)
top_percentile_4sec <- Individ %>% filter(Rolling.mean.4 >= Power4)

选项2:一个大的合并数据集

top_percentile_all_times <- Individ %>% filter(Rolling.mean.2 >= Power2 | Rolling.mean.3 >= Power3 | Rolling.mean.4 >= Power4)


top_percentile_all_times

 Participant  Time Condition Location Power Rolling.mean.2 Rolling.mean.3 Rolling.mean.4 Power2   Power3
       (fctr) (dbl)    (fctr)   (fctr) (dbl)          (dbl)          (dbl)          (dbl)  (dbl)    (dbl)
1        Bill     1      Expr     Away   300          400.0       326.6667         307.50 322.15 280.7667
2        Bill     4      Expr     Away   512          556.0       520.6667         465.50 464.65 465.6667
3        Bill     1      Expr     Home   300          406.0       470.6667         465.50 337.75 305.0000
4        Bill     3      Expr     Home   450          475.0       416.6667         440.50 322.15 280.7667
5       Harry     1      Expr     Away   310          415.0       320.0000         292.50 322.15 280.7667
6       Harry     3      Expr     Away   608          529.5       456.3333         472.25 551.35 510.0667
7       Harry     4      Expr     Away   582          595.0       547.0000         487.75 464.65 465.6667
8        Paul     3      Expr     Home     0          570.0       480.0000           0.00 322.15 280.7667
9        Paul     4      Expr     Home     0            0.0       570.0000         480.00 556.50 556.5000

以下链接对我有很大帮助。

<强> how to calculate 95th percentile of values with grouping variable in R or Excel

这是否也解决了其他帖子中的问题?