数据框df1
总结了见到某人的日期。我想创建一个列来指示自从第一次见到此人以来的天数比例。
例如:
df1<- data.frame(ID=c("Peter", "Anna", "Sophie", "Peter", "Anna", "Sophie", "Peter", "Anna", "Sophie", "Peter", "Anna", "Sophie", "Peter", "Anna", "Sophie"),
Date= c("2016-08-20","2016-08-20","2016-08-23","2016-08-21","2016-08-23","2016-08-24","2016-08-23","2016-08-23","2016-08-25","2016-08-27","2016-08-28","2016-08-26","2016-08-27","2016-08-29","2016-08-30"))
df1$Date<- as.Date(df1$Date, format="%Y-%m-%d")
df1
ID Date
1 Peter 2016-08-20
2 Anna 2016-08-20
3 Sophie 2016-08-23
4 Peter 2016-08-21
5 Anna 2016-08-23
6 Sophie 2016-08-24
7 Peter 2016-08-23
8 Anna 2016-08-23
9 Sophie 2016-08-25
10 Peter 2016-08-27
11 Anna 2016-08-28
12 Sophie 2016-08-26
13 Peter 2016-08-27
14 Anna 2016-08-29
15 Sophie 2016-08-30
重要:对于每个人,第一次见面都不同。
这是我期望的(我手动进行了计算,因此可能会有一些错误):
> df1
ID Date Prop_days_seen
1 Peter 2016-08-20 1.00 # 1/1 (First time will always be 1)
2 Anna 2016-08-20 1.00 # 1/1 (First time will always be 1)
3 Sophie 2016-08-23 1.00 # 1/1 (First time will always be 1)
4 Peter 2016-08-21 1.00 # 2/2
5 Anna 2016-08-23 0.50 # 2/4 (two days seen out of 4 days that she could have been seen)
6 Sophie 2016-08-24 1.00 # 2/2 (two days seen out of 2 days she could have been seen)
7 Peter 2016-08-23 0.75 # 3/4
8 Anna 2016-08-23 0.50 # So on...
9 Sophie 2016-08-25 1.00
10 Peter 2016-08-27 0.50
11 Anna 2016-08-28 0.33
12 Sophie 2016-08-26 1.00
13 Peter 2016-08-27 0.50
14 Anna 2016-08-29 0.40
15 Sophie 2016-08-30 0.62
有人知道如何在R中做到这一点吗?
答案 0 :(得分:1)
一个选项是
library(zoo)
df1$Prop_days_seen <- round(unsplit(lapply(split(df1$Date, df1$ID), function(x) {
i1 <- cumsum(c(1, as.integer(diff(x))))
i2 <- !duplicated(i1);v1 <- numeric(length(x))
v1[!i2] <- NA
v1[i2] <- seq_along(x[i2])/i1[i2]
na.locf(v1) }), df1$ID), 2)
df1
# ID Date Prop_days_seen
#1 Peter 2016-08-20 1.00
#2 Anna 2016-08-20 1.00
#3 Sophie 2016-08-23 1.00
#4 Peter 2016-08-21 1.00
#5 Anna 2016-08-23 0.50
#6 Sophie 2016-08-24 1.00
#7 Peter 2016-08-23 0.75
#8 Anna 2016-08-23 0.50
#9 Sophie 2016-08-25 1.00
#10 Peter 2016-08-27 0.50
#11 Anna 2016-08-28 0.33
#12 Sophie 2016-08-26 1.00
#13 Peter 2016-08-27 0.50
#14 Anna 2016-08-29 0.40
#15 Sophie 2016-08-30 0.62
此外,它可以变得更紧凑
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(n1 = cumsum(c(1, as.integer(diff(Date)))),
Prop_days_seen = cumsum(!duplicated(n1))/n1) %>%
select(-n1)
# A tibble: 15 x 3
# Groups: ID [3]
# ID Date Prop_days_seen
# <fct> <date> <dbl>
# 1 Peter 2016-08-20 1
# 2 Anna 2016-08-20 1
# 3 Sophie 2016-08-23 1
# 4 Peter 2016-08-21 1
# 5 Anna 2016-08-23 0.5
# 6 Sophie 2016-08-24 1
# 7 Peter 2016-08-23 0.75
# 8 Anna 2016-08-23 0.5
# 9 Sophie 2016-08-25 1
#10 Peter 2016-08-27 0.5
#11 Anna 2016-08-28 0.333
#12 Sophie 2016-08-26 1
#13 Peter 2016-08-27 0.5
#14 Anna 2016-08-29 0.4
#15 Sophie 2016-08-30 0.625