如何添加一个变量来估计某人自第一次被看过的天数

时间:2019-07-07 15:57:27

标签: r dplyr

数据框df1总结了见到某人的日期。我想创建一个列来指示自从第一次见到此人以来的天数比例。

例如:

df1<- data.frame(ID=c("Peter", "Anna", "Sophie", "Peter", "Anna", "Sophie", "Peter", "Anna", "Sophie", "Peter", "Anna", "Sophie", "Peter", "Anna", "Sophie"),
                 Date= c("2016-08-20","2016-08-20","2016-08-23","2016-08-21","2016-08-23","2016-08-24","2016-08-23","2016-08-23","2016-08-25","2016-08-27","2016-08-28","2016-08-26","2016-08-27","2016-08-29","2016-08-30"))
df1$Date<- as.Date(df1$Date, format="%Y-%m-%d")
df1

       ID       Date
1   Peter 2016-08-20
2    Anna 2016-08-20
3  Sophie 2016-08-23
4   Peter 2016-08-21
5    Anna 2016-08-23
6  Sophie 2016-08-24
7   Peter 2016-08-23
8    Anna 2016-08-23
9  Sophie 2016-08-25
10  Peter 2016-08-27
11   Anna 2016-08-28
12 Sophie 2016-08-26
13  Peter 2016-08-27
14   Anna 2016-08-29
15 Sophie 2016-08-30

重要:对于每个人,第一次见面都不同。

这是我期望的(我手动进行了计算,因此可能会有一些错误):

> df1
       ID       Date Prop_days_seen
1   Peter 2016-08-20           1.00  # 1/1 (First time will always be 1)
2    Anna 2016-08-20           1.00  # 1/1 (First time will always be 1)
3  Sophie 2016-08-23           1.00  # 1/1 (First time will always be 1)
4   Peter 2016-08-21           1.00  # 2/2
5    Anna 2016-08-23           0.50  # 2/4 (two days seen out of 4 days that she could have been seen)
6  Sophie 2016-08-24           1.00  # 2/2 (two days seen out of 2 days she could have been seen)
7   Peter 2016-08-23           0.75  # 3/4
8    Anna 2016-08-23           0.50  # So on...
9  Sophie 2016-08-25           1.00
10  Peter 2016-08-27           0.50
11   Anna 2016-08-28           0.33
12 Sophie 2016-08-26           1.00
13  Peter 2016-08-27           0.50
14   Anna 2016-08-29           0.40
15 Sophie 2016-08-30           0.62

有人知道如何在R中做到这一点吗?

1 个答案:

答案 0 :(得分:1)

一个选项是

library(zoo)
df1$Prop_days_seen <- round(unsplit(lapply(split(df1$Date, df1$ID), function(x) {
      i1 <- cumsum(c(1, as.integer(diff(x))))
      i2 <- !duplicated(i1);v1 <- numeric(length(x))
      v1[!i2] <- NA
      v1[i2] <- seq_along(x[i2])/i1[i2]
      na.locf(v1) }), df1$ID), 2)

df1
#       ID       Date Prop_days_seen
#1   Peter 2016-08-20           1.00
#2    Anna 2016-08-20           1.00
#3  Sophie 2016-08-23           1.00
#4   Peter 2016-08-21           1.00
#5    Anna 2016-08-23           0.50
#6  Sophie 2016-08-24           1.00
#7   Peter 2016-08-23           0.75
#8    Anna 2016-08-23           0.50
#9  Sophie 2016-08-25           1.00
#10  Peter 2016-08-27           0.50
#11   Anna 2016-08-28           0.33
#12 Sophie 2016-08-26           1.00
#13  Peter 2016-08-27           0.50
#14   Anna 2016-08-29           0.40
#15 Sophie 2016-08-30           0.62

此外,它可以变得更紧凑

library(dplyr)
df1 %>% 
  group_by(ID) %>% 
  mutate(n1 = cumsum(c(1, as.integer(diff(Date)))),  
         Prop_days_seen = cumsum(!duplicated(n1))/n1)  %>% 
  select(-n1)
# A tibble: 15 x 3
# Groups:   ID [3]
#   ID     Date       Prop_days_seen
#   <fct>  <date>              <dbl>
# 1 Peter  2016-08-20          1    
# 2 Anna   2016-08-20          1    
# 3 Sophie 2016-08-23          1    
# 4 Peter  2016-08-21          1    
# 5 Anna   2016-08-23          0.5  
# 6 Sophie 2016-08-24          1    
# 7 Peter  2016-08-23          0.75 
# 8 Anna   2016-08-23          0.5  
# 9 Sophie 2016-08-25          1    
#10 Peter  2016-08-27          0.5  
#11 Anna   2016-08-28          0.333
#12 Sophie 2016-08-26          1    
#13 Peter  2016-08-27          0.5  
#14 Anna   2016-08-29          0.4  
#15 Sophie 2016-08-30          0.625