我有一张桌子:
ID Dates Rates
1 2010-01-01 0
1 2010-01-02 0
1 2010-01-03 2
1 2010-01-04 2
1 2010-01-05 2
1 2010-01-06 1
1 2010-01-07 0
1 2010-01-08 0
1 2010-01-09 0
1 2010-01-10 0
2 2010-01-01 3
2 2010-01-02 3
2 2010-01-03 2
我想计算第三列名为"中位数"在Rstudio中显示每5个连续行的中值,该表应如下所示
ID Dates Rates Median_Rates
1 2010-01-01 0 2
1 2010-01-02 0 2
1 2010-01-03 2 2
1 2010-01-04 2 2
1 2010-01-05 2 2
1 2010-01-06 1 0
1 2010-01-07 0 0
1 2010-01-08 0 0
1 2010-01-09 0 0
1 2010-01-10 0 0
2 2010-01-01 3 3
2 2010-01-02 3 3
2 2010-01-03 2 3
然后将其应用于数据集中的所有ID和超过100万行?
我想按组(ID)计算每个连续5行(例如此位置+/- 5行)的Rate
的中值,并将其用作Median_Rates
的值。< / p>
答案 0 :(得分:1)
函数ave
就是为了这个
我从the accepted answer to this question借用了这个想法,将tapply
更改为ave
,将sum
更改为median
。
data$Median_Rates <- ave(data$Rates, (seq_along(data$Rates)-1) %/% 5, FUN = median)
data
# ID Dates Rates Median_Rates
#1 1 2010-01-01 0 2
#2 2 2010-01-02 0 2
#3 3 2010-01-03 2 2
#4 4 2010-01-04 2 2
#5 5 2010-01-05 2 2
#6 5 2010-01-06 1 0
#7 7 2010-01-07 0 0
#8 8 2010-01-08 0 0
#9 9 2010-01-09 0 0
#10 10 2010-01-10 0 0
数据
data <-
structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 5L, 7L, 8L, 9L, 10L
), Dates = structure(1:10, .Label = c("2010-01-01", "2010-01-02",
"2010-01-03", "2010-01-04", "2010-01-05", "2010-01-06", "2010-01-07",
"2010-01-08", "2010-01-09", "2010-01-10"), class = "factor"),
Rates = c(0L, 0L, 2L, 2L, 2L, 1L, 0L, 0L, 0L, 0L)), .Names = c("ID",
"Dates", "Rates"), class = "data.frame", row.names = c(NA, -10L
))
修改强>
使用新数据集时,所需的只是在ID
调用中将列ave
作为分组变量。
我将调用此新数据集data2
。
data2$Median_Rates <- ave(data2$Rates, data2$ID, (seq_along(data2$Rates)-1) %/% 5, FUN = median)
data2
# ID Dates Rates Median_Rates
#1 1 2010-01-01 0 2
#2 1 2010-01-02 0 2
#3 1 2010-01-03 2 2
#4 1 2010-01-04 2 2
#5 1 2010-01-05 2 2
#6 1 2010-01-06 1 0
#7 1 2010-01-07 0 0
#8 1 2010-01-08 0 0
#9 1 2010-01-09 0 0
#10 1 2010-01-10 0 0
#11 2 2010-01-01 3 3
#12 2 2010-01-02 3 3
#13 2 2010-01-03 2 3
新数据
data2 <-
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L), Dates = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 9L, 10L, 1L, 2L, 3L), .Label = c("2010-01-01", "2010-01-02",
"2010-01-03", "2010-01-04", "2010-01-05", "2010-01-06", "2010-01-07",
"2010-01-08", "2010-01-09", "2010-01-10"), class = "factor"),
Rates = c(0L, 0L, 2L, 2L, 2L, 1L, 0L, 0L, 0L, 0L, 3L, 3L,
2L)), .Names = c("ID", "Dates", "Rates"), class = "data.frame", row.names = c(NA,
-13L))
答案 1 :(得分:1)
使用dplyr
转换为lubridate
的基于Date
的解决方案可以实现为:
library(dplyr)
library(lubridate)
df %>% mutate(Dates = ymd(Dates)) %>%
group_by(ID) %>%
arrange(Dates) %>%
mutate(Group = (row_number()-1) %/% 5 ) %>%
group_by(ID, Group) %>%
mutate(Median_Rates = median(Rates)) %>%
ungroup() %>%
arrange(ID) %>%
select(-Group) %>% as.data.frame()
# ID Dates Rates Median_Rates
# 1 1 2010-01-01 0 2
# 2 1 2010-01-02 0 2
# 3 1 2010-01-03 2 2
# 4 1 2010-01-04 2 2
# 5 1 2010-01-05 2 2
# 6 1 2010-01-06 1 0
# 7 1 2010-01-07 0 0
# 8 1 2010-01-08 0 0
# 9 1 2010-01-09 0 0
# 10 1 2010-01-10 0 0
# 11 2 2010-01-01 3 3
# 12 2 2010-01-02 3 3
# 13 2 2010-01-03 2 3