如何在r中添加一个中间值为5连续行的列?

时间:2018-04-12 18:13:28

标签: r

我有一张桌子:

ID   Dates        Rates
1  2010-01-01       0
1  2010-01-02       0
1  2010-01-03       2
1  2010-01-04       2
1  2010-01-05       2
1  2010-01-06       1
1  2010-01-07       0
1  2010-01-08       0
1  2010-01-09       0
1  2010-01-10       0
2  2010-01-01       3
2  2010-01-02       3
2  2010-01-03       2

我想计算第三列名为"中位数"在Rstudio中显示每5个连续行的中值,该表应如下所示

ID   Dates       Rates   Median_Rates
1    2010-01-01   0        2
1    2010-01-02   0        2
1    2010-01-03   2        2
1    2010-01-04   2        2
1    2010-01-05   2        2
1    2010-01-06   1        0
1    2010-01-07   0        0
1    2010-01-08   0        0
1    2010-01-09   0        0
1    2010-01-10   0        0
2    2010-01-01   3        3
2    2010-01-02   3        3
2    2010-01-03   2        3

然后将其应用于数据集中的所有ID和超过100万行?

我想按组(ID)计算每个连续5行(例如此位置+/- 5行)的Rate的中值,并将其用作Median_Rates的值。< / p>

2 个答案:

答案 0 :(得分:1)

函数ave就是为了这个 我从the accepted answer to this question借用了这个想法,将tapply更改为ave,将sum更改为median

data$Median_Rates <- ave(data$Rates, (seq_along(data$Rates)-1) %/% 5, FUN = median)
data
#   ID      Dates Rates Median_Rates
#1   1 2010-01-01     0            2
#2   2 2010-01-02     0            2
#3   3 2010-01-03     2            2
#4   4 2010-01-04     2            2
#5   5 2010-01-05     2            2
#6   5 2010-01-06     1            0
#7   7 2010-01-07     0            0
#8   8 2010-01-08     0            0
#9   9 2010-01-09     0            0
#10 10 2010-01-10     0            0

数据

data <-
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 5L, 7L, 8L, 9L, 10L
), Dates = structure(1:10, .Label = c("2010-01-01", "2010-01-02", 
"2010-01-03", "2010-01-04", "2010-01-05", "2010-01-06", "2010-01-07", 
"2010-01-08", "2010-01-09", "2010-01-10"), class = "factor"), 
    Rates = c(0L, 0L, 2L, 2L, 2L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID", 
"Dates", "Rates"), class = "data.frame", row.names = c(NA, -10L
))

修改
使用新数据集时,所需的只是在ID调用中将列ave作为分组变量。
我将调用此新数据集data2

data2$Median_Rates <- ave(data2$Rates, data2$ID, (seq_along(data2$Rates)-1) %/% 5, FUN = median)
data2
#   ID      Dates Rates Median_Rates
#1   1 2010-01-01     0            2
#2   1 2010-01-02     0            2
#3   1 2010-01-03     2            2
#4   1 2010-01-04     2            2
#5   1 2010-01-05     2            2
#6   1 2010-01-06     1            0
#7   1 2010-01-07     0            0
#8   1 2010-01-08     0            0
#9   1 2010-01-09     0            0
#10  1 2010-01-10     0            0
#11  2 2010-01-01     3            3
#12  2 2010-01-02     3            3
#13  2 2010-01-03     2            3

新数据

data2 <-
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L), Dates = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 1L, 2L, 3L), .Label = c("2010-01-01", "2010-01-02", 
"2010-01-03", "2010-01-04", "2010-01-05", "2010-01-06", "2010-01-07", 
"2010-01-08", "2010-01-09", "2010-01-10"), class = "factor"), 
    Rates = c(0L, 0L, 2L, 2L, 2L, 1L, 0L, 0L, 0L, 0L, 3L, 3L, 
    2L)), .Names = c("ID", "Dates", "Rates"), class = "data.frame", row.names = c(NA, 
-13L))

答案 1 :(得分:1)

使用dplyr转换为lubridate的基于Date的解决方案可以实现为:

library(dplyr)
library(lubridate)

df %>% mutate(Dates = ymd(Dates)) %>%
  group_by(ID) %>%
  arrange(Dates) %>%
  mutate(Group = (row_number()-1) %/% 5 ) %>%
  group_by(ID, Group) %>%
  mutate(Median_Rates = median(Rates)) %>%
  ungroup() %>%
  arrange(ID) %>%
  select(-Group) %>% as.data.frame()

#    ID      Dates Rates Median_Rates
# 1   1 2010-01-01     0            2
# 2   1 2010-01-02     0            2
# 3   1 2010-01-03     2            2
# 4   1 2010-01-04     2            2
# 5   1 2010-01-05     2            2
# 6   1 2010-01-06     1            0
# 7   1 2010-01-07     0            0
# 8   1 2010-01-08     0            0
# 9   1 2010-01-09     0            0
# 10  1 2010-01-10     0            0
# 11  2 2010-01-01     3            3
# 12  2 2010-01-02     3            3
# 13  2 2010-01-03     2            3