我有一个类似于下面的数据框,其中包含我需要查看访问次数的日期。但是,对于1个唯一id,条件是,如果第一行的enddt和下一行的strdt之间的差异是<按降序排列后,我们应将其视为1次访问。
数据
id strdt enddt
ep01 2017-06-23 2017-06-24
ep01 2017-06-28 2017-06-30
ep01 2017-06-25 2017-06-26
ep02 2017-05-06 2017-05-10
ep02 2017-05-12 2017-05-14
ep02 2017-05-15 2017-05-16
ep03 2017-05-15 2017-05-16
ep04 2017-05-15 2017-05-17
预期产出:
id strdt enddt
ep01 2017-06-23 2017-06-26
ep01 2017-06-28 2017-06-30
ep02 2017-05-06 2017-05-10
ep02 2017-05-12 2017-05-16
ep03 2017-05-15 2017-05-16
ep04 2017-05-15 2017-05-17
尝试
data = read.csv("data.csv",header = T,stringsAsFactors = F)
unique_id = unique(data$id)
id_data = NULL
for (i in 1: length(unique_id)){
id_data = data[data$id == unique_id[i],]
id_data = id_data[ order(id_data$strdt , decreasing = F ),]
id_data = ifelse(id_data$enddt - id_data$str_dt < 1, id_data$enddt[2,3],id_data$enddt)
}
我尝试使用上面的代码,但我无法做到。提前致谢。
答案 0 :(得分:1)
lead
的 dplyr
功能可能对您的问题有所帮助。
https://www.rdocumentation.org/packages/dplyr/versions/0.7.3/topics/lead-lag
我还没有创建一个完全有效的解决方案,但逻辑可以从以下代码中推断出来
library("dplyr")
dat <- data.frame(id <- c("ep01", "ep01", "ep01", "ep02", "ep02", "ep02", "ep03", "ep04"),
startdt <- as.Date(c("2017-06-23", "2017-06-28", "2017-06-25", "2017-05-06", "2017-05-12", "2017-05-15", "2017-05-15", "2017-05-15")),
enddt <- as.Date(c("2017-06-24", "2017-06-30", "2017-06-26", "2017-05-10", "2017-05-14", "2017-05-16", "2017-05-16", "2017-05-17"))
)
colnames(dat) <- c("id", "startdt", "enddt")
# get next start date, you can use dplyr::group_by() to get next start date for each id
dat$start_lead <- lead(dat$startdt)
# calculate difference between next start date and current end date, if diff < 2, then reject otherwise accept
dat$is_less_thn_2 <- ifelse(dat$start_lead - dat$enddt < 2, 0, 1)
# get next diff value
dat$take_enddt_value <- lead(dat$is_less_thn_2)
# This part won't compile
for(i in 1:nrow(dat)) {
# if take_enddt_value is 0, iterate until take_enddt_value is 1, set current enddt value to enddt with take_enddt_value = 1
if (dat[i, "take_enddt_value"] == 0){
k = i
while(dat[k, "take_enddt_value"] == 0){
k = k + 1
}
dat[i, "enddt"] <- dat[k, "enddt"]
}
}
答案 1 :(得分:1)
另一种方法可以是对要组合的行进行分组以计算开始和放大。结束日期。请注意最终flag
声明之前的group_by
列
library(dplyr)
library(data.table)
df %>%
arrange(id, strdt) %>%
group_by(id) %>%
mutate(flag = as.numeric(strdt - lag(enddt, order_by = id, default = first(strdt)))) %>%
mutate(flag = rleid(ifelse((flag < 2 & row_number() != 1) | lead(flag, order_by = id, default = 9999) < 2,
9999,
row_number()))) %>% #final grouping happened here
group_by(id, flag) %>%
summarise(strdt = first(strdt),
enddt = last(enddt)) %>%
select(-flag)
输出为:
id strdt enddt
1 ep01 2017-06-23 2017-06-26
2 ep01 2017-06-28 2017-06-30
3 ep02 2017-05-06 2017-05-10
4 ep02 2017-05-12 2017-05-16
5 ep03 2017-05-15 2017-05-16
6 ep04 2017-05-15 2017-05-17
示例数据:
df <- structure(list(id = c("ep01", "ep01", "ep01", "ep02", "ep02",
"ep02", "ep03", "ep04"), strdt = structure(c(17340, 17345, 17342,
17292, 17298, 17301, 17301, 17301), class = "Date"), enddt = structure(c(17341,
17347, 17343, 17296, 17300, 17302, 17302, 17303), class = "Date")), .Names = c("id",
"strdt", "enddt"), row.names = c(NA, -8L), class = "data.frame")