从R中的字符串中提取日期

时间:2016-06-18 16:39:15

标签: regex r date data.table

我有一个字符串向量,如下所示。我想提取日期。

background-size:100% 100%

在提取日期之前,我想创建一个Date_Flag。我使用了以下代码,但它提供了不同的输出:

check_values <- c("deficit based on wage statement 7/14/ to 7/17/2015",
                "Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2",
                "Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;",
                "Deficit due to wage statement: 4/22/15 thru 5/12/15",
                "depos transcript 7/10/15 for 7/8/15 depos",
                "difference owed for 4/25/15-5/22/15",
                "tpd 4:30:2015 - 5:22:2015",
                "Medical TREATMENT DATES:  6/30/2015 -  6/30/2015",
                "4/25/15-5/22/15",
                "Medical")

                check_values <- as.data.table(check_values)
                names(check_values) <- "check_memo"

在创建Date_Flag之后,我想提取日期(两个部分)。有人可以告诉我上面的常规回归有什么问题吗?

由于

1 个答案:

答案 0 :(得分:3)

我们可以使用str_count来创建&#39; Date_Flag&#39;假设在&check; meme&#39;的每个元素中有2个完整日期,我们得到TRUE,否则为FALSE。

library(data.table)
library(stringr)
pat <- "[0-9]{1,2}[/:][0-9]{1,2}[/:][0-9]{2,4}"
check_values[,Date_Flag := str_count(check_memo, pat)==2]
check_values
#                                             check_memo Date_Flag
#1:   deficit based on wage statement 7/14/ to 7/17/2015     FALSE
#2: Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2     FALSE
#3:    Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;      TRUE
#4:  Deficit due to wage statement: 4/22/15 thru 5/12/15      TRUE
#5:            depos transcript 7/10/15 for 7/8/15 depos      TRUE
#6:                  difference owed for 4/25/15-5/22/15      TRUE
#7:                            tpd 4:30:2015 - 5:22:2015      TRUE
#8:     Medical TREATMENT DATES:  6/30/2015 -  6/30/2015      TRUE
#9:                                      4/25/15-5/22/15      TRUE
#10:                                             Medical     FALSE

如果我们需要提取日期,请使用与str_extract_all

相同的模式
check_values[(Date_Flag),  paste0("Date", 1:2) := 
                  transpose(str_extract_all(check_memo, pat))]

check_values
                                              check_memo #Date_Flag     Date1     Date2
# 1:   deficit based on wage statement 7/14/ to 7/17/2015     FALSE        NA        NA
# 2: Deficit Due: $1205.73 -$879.63= $326.10 x 70%=$228.2     FALSE        NA        NA
# 3:    Deficit Due for 12 wks pd - 7/14/15 thru 10/5/15;      TRUE   7/14/15   10/5/15
# 4:  Deficit due to wage statement: 4/22/15 thru 5/12/15      TRUE   4/22/15   5/12/15
# 5:            depos transcript 7/10/15 for 7/8/15 depos      TRUE   7/10/15    7/8/15
# 6:                  difference owed for 4/25/15-5/22/15      TRUE   4/25/15   5/22/15
# 7:                            tpd 4:30:2015 - 5:22:2015      TRUE 4:30:2015 5:22:2015
# 8:     Medical TREATMENT DATES:  6/30/2015 -  6/30/2015      TRUE 6/30/2015 6/30/2015
# 9:                                      4/25/15-5/22/15      TRUE   4/25/15   5/22/15
#10:                                              Medical     FALSE        NA        NA