根据当前行值和前一行对数据帧求平均值

时间:2017-04-06 09:28:23

标签: r dplyr

我有一个简单的数据集,其格式如下

df<- data.frame(c(10, 10, 10,  10,  10,  10,  10,  10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),   
                c(80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90, 80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90),
                c(1,    1,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   4,   4,  1,    1,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   4,   4),
                c(25,   75,  20,  40,  60,  50,  20,  10,  20,  30,  40,  60, 25,   75,  20,  40, 5,   5,  2,  4,  6,  5,  2,  1,  2,  3,  4,  6, 2,   7,  2,  4))

colnames(df)<-c("car_number", "year", "marker", "val")

我想要做的很简单,实际上:每car_number,我想找到与marker - 值和前面3个值相关联的值的平均值。因此,对于上面的示例数据,我想要的输出是

car=10, year=80 1: 50
car=10, year=80 2: 40
car=10, year=80 3: 45
car=10, year=80 4: 37.5

car=10, year=90 1: 31.25
car=10, year=90 2: 36.25
car=10, year=90 3: 35
car=10, year=90 4: 38.75


car=20, year=80 1: 5
car=20, year=80 2: 4
car=20, year=80 3: 4.5
car=20, year=80 4: 3.75

car=20, year=90 1: 3.125
car=20, year=90 2: 3.625
car=20, year=90 3: 3.375
car=20, year=90 4: 3.750

请注意,为了简化示例,上面的markers成对出现。实际数据并非如此,所以我认为一般解决方案将包含某种group_by(?)

欢迎任何有效的解决方案!

这是第二个示例数据集和输出:

df<- data.frame(c(10, 10, 10,  10,  10,  10,  10,  10, 10, 10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20),   
                c(80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90, 80,  80,  80,  80,  80,  80,  80,  80, 90, 90, 90, 90, 90, 90, 90, 90),
                c(1,    2,   2,   2,   3,   3,   4,   4,     1,   1,    2, 2,   3,    3,   3,   4,  1,    1,   1,   2,   3,   3,   4,   4,     4,   1,    2, 2,   3,    3,   3,   4),
                c(25,   75,  20,  40,  60,  50,  20,  10,  20,  30,  40,  60, 25,   75,  20,  40, 5,   5,  2,  4,  6,  5,  2,  1,  2,  3,  4,  6, 2,   7,  2,  4))

colnames(df)<-c("car_number", "year", "marker", "val")

输出是(基于上述规则)

car=10, year=80 1: Mean{{25}]                                  = 25
car=10, year=80 2: Mean[{40, 20, 75, 25}]                      = 40
car=10, year=80 3: Mean[{50, 60, 40, 20, 75, 25}]              = 45
car=10, year=80 4: Mean[{10, 20, 50, 60, 40, 20, 75, 25}]      = 37.5

car=10, year=90 1: Mean[{30, 20, 10, 20, 50, 60, 40, 20, 75}] = 36.11
car=10, year=90 2: Mean[{60, 40, 30, 20, 10, 20, 50, 60}]     = 36.25
car=10, year=90 3: Mean[{20, 75, 25, 60, 40, 30, 20, 10, 20}] = 33.33
car=10, year=90 4: Mean[{40, 20, 75, 25, 60, 40, 30, 20}]     = 38.75

car=20, year=80 1: Mean[{2, 5, 5}]                   = 4
car=20, year=80 2: Mean[{4, 2, 5, 5}]                = 4
car=20, year=80 3: Mean[{5, 6, 4, 2, 5, 5}]          = 4.5
car=20, year=80 4: Mean[{2, 1, 2, 5, 6, 4, 2, 5, 5}] = 3.55

car=20, year=90 1: Mean[{3, 2, 1, 2, 5, 6, 4}]       = 3.29
car=20, year=90 2: Mean[{6, 4, 3, 2, 1, 2, 5, 6}]    = 3.625
car=20, year=90 3: Mean[{2, 7, 2, 6, 4, 3, 2, 1, 2}] = 3.22
car=20, year=90 4: Mean[{4, 2, 7, 2, 6, 4, 3}]       = 4

5 个答案:

答案 0 :(得分:2)

第一个group_bycar_numberyearmarker计算均值,并保留其重量(行数)。
group_by的第car_number个允许我们检索lag ging均值和权重以计算所需的均值:

library(purrr)
library(dplyr)
df %>% 
  arrange(car_number, year, marker) %>% 
  group_by(car_number, year, marker) %>% 
  summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>% 
  group_by(car_number) %>% 
  mutate(mean_2 = pmap_dbl(
    list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
         weight, lag(weight), lag(weight, 2), lag(weight, 3)),
    ~ weighted.mean(c(..1, ..2, ..3, ..4),
                    c(..5, ..6, ..7, ..8),
                    na.rm = TRUE)
  )) %>%
  ungroup()

结果:

# # A tibble: 16 × 6
#    car_number  year marker mean_1 weight mean_2
#         <dbl> <dbl>  <dbl>  <dbl>  <int>  <dbl>
# 1          10    80      1   50.0      2 50.000
# 2          10    80      2   30.0      2 40.000
# 3          10    80      3   55.0      2 45.000
# 4          10    80      4   15.0      2 37.500
# 5          10    90      1   25.0      2 31.250
# 6          10    90      2   50.0      2 36.250
# 7          10    90      3   50.0      2 35.000
# 8          10    90      4   30.0      2 38.750
# 9          20    80      1    5.0      2  5.000
# 10         20    80      2    3.0      2  4.000
# 11         20    80      3    5.5      2  4.500
# 12         20    80      4    1.5      2  3.750
# 13         20    90      1    2.5      2  3.125
# 14         20    90      2    5.0      2  3.625
# 15         20    90      3    4.5      2  3.375
# 16         20    90      4    3.0      2  3.750

编辑: purrr之前0.2.2.9000版本的替代语法:

df %>% 
  arrange(car_number, year, marker) %>% 
  group_by(car_number, year, marker) %>% 
  summarise(mean_1 = mean(val, na.rm = TRUE), weight = n()) %>% 
  group_by(car_number) %>% 
  mutate(mean_2 = pmap_dbl(
    list(mean_1, lag(mean_1), lag(mean_1, 2), lag(mean_1, 3),
         weight, lag(weight), lag(weight, 2), lag(weight, 3)),
    function(a, b, c, d, e, f, g, h)
      weighted.mean(c(a, b, c, d),
                    c(e, f, g, h),
                    na.rm = TRUE)
  )) %>%
  ungroup()

答案 1 :(得分:2)

这是一种方法,data.table修改了Frank在David Arenburg的回答here中的建议。

# aggregate data by car_number, year, and marker
dfNew <- setDT(df)[, .(val=mean(val)), by=.(car_number, year, marker)]
# calculate average of current a previous three values
dfNew[, val := rowMeans(dfNew[,shift(val, 0:3), by=car_number][, -1], na.rm=TRUE)]

第一行是标准聚合调用。第二行对链接答案中的rowMeans方法进行了一些更改。 rowMeans被提供了一个移位值的data.table,其中转移由car_number发生(感谢sotos捕获这个),它被链接到一个删除第一列的语句(使用-1),这是car_number列在链的第一部分返回。

返回

   car_number year marker    val
 1:         10   80      1 50.000
 2:         10   80      2 40.000
 3:         10   80      3 45.000
 4:         10   80      4 37.500
 5:         10   90      1 31.250
 6:         10   90      2 36.250
 7:         10   90      3 35.000
 8:         10   90      4 38.750
 9:         20   80      1  5.000
10:         20   80      2  4.000
11:         20   80      3  4.500
12:         20   80      4  3.750
13:         20   90      1  3.125
14:         20   90      2  3.625
15:         20   90      3  3.375
16:         20   90      4  3.750

答案 2 :(得分:2)

只需在混合物中投入基础R溶液。我们可以使用Reduceaccumulate = TRUE以及tail(x, 4)制作自定义函数,以确保仅包含最后3个观察结果。所有这些在我们对car_type, year, marker设置的数据进行平均后,即

f1 <- function(x){
    sapply(Reduce(c, x, accumulate = TRUE), function(i) mean(tail(i,4)))
  }

dd <- aggregate(val ~ car_number+year+marker, df, mean)
dd <- dd[order(dd$car_number, dd$year, dd$marker),]
dd$new_avg <- with(dd, ave(val, car_number, FUN = f1))

dd
#   car_number year marker  val new_avg
#1          10   80      1 50.0  50.000
#5          10   80      2 30.0  40.000
#9          10   80      3 55.0  45.000
#13         10   80      4 15.0  37.500
#3          10   90      1 25.0  31.250
#7          10   90      2 50.0  36.250
#11         10   90      3 50.0  35.000
#15         10   90      4 30.0  38.750
#2          20   80      1  5.0   5.000
#6          20   80      2  3.0   4.000
#10         20   80      3  5.5   4.500
#14         20   80      4  1.5   3.750
#4          20   90      1  2.5   3.125
#8          20   90      2  5.0   3.625
#12         20   90      3  4.5   3.375
#16         20   90      4  3.0   3.750

答案 3 :(得分:0)

考虑df作为您的输入,您可以使用dplyrzoo并尝试:

仅在car_number上进行分组,您可以尝试:

df %>%
  group_by(car_number, year, marker) %>%
  summarise(mm = mean(val)) %>%
  group_by(car_number) %>%
  mutate(rM=rollapply(mm, if_else(row_number() < 4, marker, 4), FUN=mean, align="right"))%>%
  select(year, rM)

给出:

Source: local data frame [16 x 3]
Groups: car_number [2]

   car_number  year     rM
        <dbl> <dbl>  <dbl>
1          10    80 50.000
2          10    80 40.000
3          10    80 45.000
4          10    80 37.500
5          10    90 31.250
6          10    90 36.250
7          10    90 35.000
8          10    90 38.750
9          20    80  5.000
10         20    80  4.000
11         20    80  4.500
12         20    80  3.750
13         20    90  3.125
14         20    90  3.625
15         20    90  3.375
16         20    90  3.750

答案 4 :(得分:0)

你可以这样做:

df %>%
  group_by(car_number, year, marker) %>%
  summarise(s = sum(val), w = n()) %>% # sum and number of values
  group_by(car_number) %>%
  mutate(S = cumsum(s) - cumsum(lag(s, 4, default=0))) %>% # sum of last four s
  mutate(W = cumsum(w) - cumsum(lag(w, 4, default=0))) %>% # same for the weights
  mutate(result = S/W)

第二个例子的输出:

# Source: local data frame [16 x 8]
# Groups: car_number [2]
# 
#    car_number  year marker     s     w     S     W    result
#         <dbl> <dbl>  <dbl> <dbl> <int> <dbl> <int>     <dbl>
# 1          10    80      1    25     1    25     1 25.000000
# 2          10    80      2   135     3   160     4 40.000000
# 3          10    80      3   110     2   270     6 45.000000
# 4          10    80      4    30     2   300     8 37.500000
# 5          10    90      1    50     2   325     9 36.111111
# 6          10    90      2   100     2   290     8 36.250000
# 7          10    90      3   120     3   300     9 33.333333
# 8          10    90      4    40     1   310     8 38.750000
# 9          20    80      1    12     3    12     3  4.000000
# 10         20    80      2     4     1    16     4  4.000000
# 11         20    80      3    11     2    27     6  4.500000
# 12         20    80      4     5     3    32     9  3.555556
# 13         20    90      1     3     1    23     7  3.285714
# 14         20    90      2    10     2    29     8  3.625000
# 15         20    90      3    11     3    29     9  3.222222
# 16         20    90      4     4     1    28     7  4.000000

修改: 使用包RcppRoll可能效率更高,您可以尝试:S = roll_sum(c(0, 0, 0, s), 4)W也一样)。