确定另一个特定值

时间:2018-02-13 22:38:40

标签: r date data.table

我有下表:

+----+------------+----------+
| ID |    Date    | Variable |
+----+------------+----------+
| a  | 12/03/2017 | d        |
| a  | 15/04/2017 | d        |
| a  | 20/06/2017 | c        |
| b  | 14/05/2017 | c        |
| b  | 15/08/2017 | c        |
| b  | 16/09/2017 | c        |
+----+------------+----------+

对于每个ID,我想在单独的列中进行检查,以确定在出现“d”值后是否存在“c”值,如下所示:

+----+------------+----------+-------+------------+
| ID |    Date    | Variable | Check |    Date    |
+----+------------+----------+-------+------------+
| a  | 12/03/2017 | d        |     1 | 20/06/2017 |
| a  | 15/04/2017 | d        |     1 | 20/06/2017 |
| a  | 20/06/2017 | c        |     1 | 20/06/2017 |
| b  | 14/05/2017 | c        |     0 | 0          |
| b  | 15/08/2017 | c        |     0 | 0          |
| b  | 16/09/2017 | c        |     0 | 0          |
+----+------------+----------+-------+------------+

这不仅仅是关于找到“c”的出现,而是关于在d之后是否出现“c”。将相应的日期放在单独的列中也会有所帮助。我试图删除重复项和&然后识别前导值(或行数> 1),但有更简单的方法吗?

任何dplyr或data.table方法都会非常有用。

3 个答案:

答案 0 :(得分:2)

使用的解决方案。必须有比这更好的方法,但我认为这应该有效。 unique(Variable[!is.na(Variable)])是获取仅包含c("c", "d")c("d", "c")"c""d"的向量。如果您确定没有NA,则可以删除!is.naDate[Variable %in% "c"][1]是选择第一个日期。

dat2 <- dat %>%
  group_by(ID) %>%
  mutate(Check = ifelse(identical(unique(Variable[!is.na(Variable)]), c("d", "c")), 
                        1L, 0L)) %>%
  mutate(Date2 = ifelse(Check == 1L, Date[Variable %in% "c"][1], "0")) %>%
  ungroup()
dat2
# # A tibble: 6 x 5
#   ID    Date       Variable Check Date2     
#   <chr> <chr>      <chr>    <int> <chr>     
# 1 a     12/03/2017 d            1 20/06/2017
# 2 a     15/04/2017 d            1 20/06/2017
# 3 a     20/06/2017 c            1 20/06/2017
# 4 b     14/05/2017 c            0 0         
# 5 b     15/08/2017 c            0 0         
# 6 b     16/09/2017 c            0 0  

数据

dat <- read.table(text = "ID Date Variable
a  '12/03/2017' d
a  '15/04/2017' d
a  '20/06/2017' c
b  '14/05/2017' c
b  '15/08/2017' c
b  '16/09/2017' c",
                  header = TRUE, stringsAsFactors = FALSE)

答案 1 :(得分:2)

data.table解决方案。 @RYoda也建议您使用data.table::shift测试您的情况,然后将结果合并回原始数据集

check <- dat[, {
       idx <- Variable =='d' & shift(Variable, type="lead") == "c"
       list(MatchDate=ifelse(any(idx), shift(Date, type="lead", fill=NA_character_)[idx][1L], "0"), 
           Check=as.integer(any(idx)))
    }, by=.(ID)]   
dat[check, on=.(ID)]

#    ID       Date Variable  MatchDate Check
# 1:  a 12/03/2017        d 20/06/2017     1
# 2:  a 15/04/2017        d 20/06/2017     1
# 3:  a 20/06/2017        c 20/06/2017     1
# 4:  b 14/05/2017        c          0     0
# 5:  b 15/08/2017        c          0     0
# 6:  b 16/09/2017        c          0     0

数据:

library(data.table)
dat <- data.table(ID=rep(c('a','b'), each=3),
    Date=c("12/03/2017","15/04/2017","20/06/2017","14/05/2017","15/08/2017","16/09/2017"),
    Variable=c('d','d','c','c','c','c'))

答案 2 :(得分:1)

可以使用fill包中的tidyr来获得一个解决方案。方法如下: 首先对CheckC_Date的行填充Variablec。然后使用fillCheck列上的C_Date函数填充上面的行。此步骤将在具有d值的行中填充所需的值。最后,只需将CheckC_Date的值替换为Variablec的行。

注意:OP建议将Check作为Variable的行c可以是01。我的解决方案认为它是0

# Data
df <- read.table(text = "ID     Date  Variable
a  12/03/2017 d
a  15/04/2017 d    
a  20/06/2017 c
b  14/05/2017 c
b  15/08/2017 c
b  16/09/2017 c", header = T, stringsAsFactors = F)   


df$Date <- as.POSIXct(df$Date, format = "%d/%m/%Y")

library(dplyr)
library(tidyr)

df %>% group_by(ID) %>%
  arrange(ID, Date) %>%
  mutate(Check = ifelse(Variable == "c", 1L, NA),
         c_Date = ifelse(Variable == "c", as.character(Date), NA) ) %>%
  fill(Check, .direction = "up") %>%
  fill(c_Date, .direction = "up") %>%
  mutate(Check = ifelse(Variable == "c", 0L, Check),
         c_Date = ifelse(Variable == "c", NA, c_Date) )


# Result
#      ID    Date                Variable Check c_Date    
#      <chr> <dttm>              <chr>    <int> <chr>     
#    1 a     2017-03-12 00:00:00 d            1 2017-06-20
#    2 a     2017-04-15 00:00:00 d            1 2017-06-20
#    3 a     2017-06-20 00:00:00 c            0 <NA>      
#    4 b     2017-05-14 00:00:00 c            0 <NA>      
#    5 b     2017-08-15 00:00:00 c            0 <NA>      
#    6 b     2017-09-16 00:00:00 c            0 <NA>