我试图从一堆非结构化文本中仅提取日期部分。
问题是,日期可以采用以下任何格式:
示例文字:
x <- "There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"
我正在尝试的是其他选项之一(来自this回答中的示例):
gsub(".*[(]|[)].*", "", string)
还有其他一般化的可能性吗?
答案 0 :(得分:3)
首先,在不知道日期格式的情况下,对于这个实例02/03/2002,你无法判断一天是一天,一个月是一个月....如果年份也可以是2位数...例如dd / mm / yy或yy / mm / dd或mm / yy / dd ......你不能说哪一天是哪一天,哪一个是月份,哪一个是一年......
考虑到所有这些因素......可能会有来自第三方的字符串,您可能无法确定格式......因此没有解决方案可以保证定义日期,月份或年份您。
但是可以识别您提到的所有数字模式。以下解决方案将为您提供三组。您将获得组1,2和3 中提到的所有格式的日期的三部分。你将不得不分析/猜测一种方法,以确定哪一个是一天,哪一个是月,哪一个是年。正则表达式无法涵盖这一点。
考虑到所有这些事实,您可以尝试以下正则表达式:
((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\.?)|(?:\d{1,2}))[\/ ,-](\d{1,2})(?:[\/ ,-]\s*(\d{4}|\d{2}))?
示例来源(run here):
library(stringr)
str<-"Jan. 16 bla bla bla Jan 16 2017 bla bla bla January 2, 2017 bla bla bla 02/01/2017 bla bla bla 01/02/2017 bla bla bla 01-02-17 bla bla bla jan. 16 There is a date which is Jan 2, 2017. Here is another date example 02/01/2017. This is third example date type [01/02/17]. This is fourth example date Jan. 16 and finally one more example is 01-02-2017"
patt <- "(?i)((?:(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-z ]*\\.?)|(?:\\d{1,2}))[\\/ ,-](\\d{1,2})(?:[\\/ ,-]\\s*(\\d{4}|\\d{2}))?"
result<-str_match_all(str,patt)
result
示例输出:
[,1] [,2] [,3] [,4]
[1,] "Jan. 16" "Jan." "16" ""
[2,] "Jan 16 2017" "Jan" "16" "2017"
[3,] "January 2, 2017" "January" "2" "2017"
[4,] "02/01/2017" "02" "01" "2017"
[5,] "01/02/2017" "01" "02" "2017"
[6,] "01-02-17" "01" "02" "17"
[7,] "jan. 16" "jan." "16" ""
[8,] "Jan 2, 2017" "Jan" "2" "2017"
[9,] "02/01/2017" "02" "01" "2017"
[10,] "01/02/17" "01" "02" "17"
[11,] "Jan. 16" "Jan." "16" ""
[12,] "01-02-2017" "01" "02" "2017"