我有下表:
+----+------------+----------+
| ID | Date | Variable |
+----+------------+----------+
| a | 12/03/2017 | d |
| a | 15/04/2017 | d |
| a | 20/06/2017 | c |
| b | 14/05/2017 | c |
| b | 15/08/2017 | c |
| b | 16/09/2017 | c |
+----+------------+----------+
对于每个ID,我想在单独的列中进行检查,以确定在出现“d”值后是否存在“c”值,如下所示:
+----+------------+----------+-------+------------+
| ID | Date | Variable | Check | Date |
+----+------------+----------+-------+------------+
| a | 12/03/2017 | d | 1 | 20/06/2017 |
| a | 15/04/2017 | d | 1 | 20/06/2017 |
| a | 20/06/2017 | c | 1 | 20/06/2017 |
| b | 14/05/2017 | c | 0 | 0 |
| b | 15/08/2017 | c | 0 | 0 |
| b | 16/09/2017 | c | 0 | 0 |
+----+------------+----------+-------+------------+
这不仅仅是关于找到“c”的出现,而是关于在d之后是否出现“c”。将相应的日期放在单独的列中也会有所帮助。我试图删除重复项和&然后识别前导值(或行数> 1),但有更简单的方法吗?
任何dplyr或data.table方法都会非常有用。
答案 0 :(得分:2)
使用dplyr的解决方案。必须有比这更好的方法,但我认为这应该有效。 unique(Variable[!is.na(Variable)])
是获取仅包含c("c", "d")
,c("d", "c")
,"c"
或"d"
的向量。如果您确定没有NA
,则可以删除!is.na
。 Date[Variable %in% "c"][1]
是选择第一个日期。
dat2 <- dat %>%
group_by(ID) %>%
mutate(Check = ifelse(identical(unique(Variable[!is.na(Variable)]), c("d", "c")),
1L, 0L)) %>%
mutate(Date2 = ifelse(Check == 1L, Date[Variable %in% "c"][1], "0")) %>%
ungroup()
dat2
# # A tibble: 6 x 5
# ID Date Variable Check Date2
# <chr> <chr> <chr> <int> <chr>
# 1 a 12/03/2017 d 1 20/06/2017
# 2 a 15/04/2017 d 1 20/06/2017
# 3 a 20/06/2017 c 1 20/06/2017
# 4 b 14/05/2017 c 0 0
# 5 b 15/08/2017 c 0 0
# 6 b 16/09/2017 c 0 0
数据强>
dat <- read.table(text = "ID Date Variable
a '12/03/2017' d
a '15/04/2017' d
a '20/06/2017' c
b '14/05/2017' c
b '15/08/2017' c
b '16/09/2017' c",
header = TRUE, stringsAsFactors = FALSE)
答案 1 :(得分:2)
data.table
解决方案。 @RYoda也建议您使用data.table::shift
测试您的情况,然后将结果合并回原始数据集
check <- dat[, {
idx <- Variable =='d' & shift(Variable, type="lead") == "c"
list(MatchDate=ifelse(any(idx), shift(Date, type="lead", fill=NA_character_)[idx][1L], "0"),
Check=as.integer(any(idx)))
}, by=.(ID)]
dat[check, on=.(ID)]
# ID Date Variable MatchDate Check
# 1: a 12/03/2017 d 20/06/2017 1
# 2: a 15/04/2017 d 20/06/2017 1
# 3: a 20/06/2017 c 20/06/2017 1
# 4: b 14/05/2017 c 0 0
# 5: b 15/08/2017 c 0 0
# 6: b 16/09/2017 c 0 0
数据:
library(data.table)
dat <- data.table(ID=rep(c('a','b'), each=3),
Date=c("12/03/2017","15/04/2017","20/06/2017","14/05/2017","15/08/2017","16/09/2017"),
Variable=c('d','d','c','c','c','c'))
答案 2 :(得分:1)
可以使用fill
包中的tidyr
来获得一个解决方案。方法如下:
首先对Check
为C_Date
的行填充Variable
和c
。然后使用fill
和Check
列上的C_Date
函数填充上面的行。此步骤将在具有d
值的行中填充所需的值。最后,只需将Check
和C_Date
的值替换为Variable
为c
的行。
注意:OP建议将Check
作为Variable
的行c
可以是0
或1
。我的解决方案认为它是0
。
# Data
df <- read.table(text = "ID Date Variable
a 12/03/2017 d
a 15/04/2017 d
a 20/06/2017 c
b 14/05/2017 c
b 15/08/2017 c
b 16/09/2017 c", header = T, stringsAsFactors = F)
df$Date <- as.POSIXct(df$Date, format = "%d/%m/%Y")
library(dplyr)
library(tidyr)
df %>% group_by(ID) %>%
arrange(ID, Date) %>%
mutate(Check = ifelse(Variable == "c", 1L, NA),
c_Date = ifelse(Variable == "c", as.character(Date), NA) ) %>%
fill(Check, .direction = "up") %>%
fill(c_Date, .direction = "up") %>%
mutate(Check = ifelse(Variable == "c", 0L, Check),
c_Date = ifelse(Variable == "c", NA, c_Date) )
# Result
# ID Date Variable Check c_Date
# <chr> <dttm> <chr> <int> <chr>
# 1 a 2017-03-12 00:00:00 d 1 2017-06-20
# 2 a 2017-04-15 00:00:00 d 1 2017-06-20
# 3 a 2017-06-20 00:00:00 c 0 <NA>
# 4 b 2017-05-14 00:00:00 c 0 <NA>
# 5 b 2017-08-15 00:00:00 c 0 <NA>
# 6 b 2017-09-16 00:00:00 c 0 <NA>