我反复遇到问题,要解析日期嵌入文本的相对非结构化文本文档的日期,其位置和格式因具体情况而异。一些示例文本是:
"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
我想从文本中提取日期字符串"July 1st, 2015"
(步骤1)并将其转换为格式,例如2015-07-01 UTC
(步骤2)。例如,可以使用包parse_date_time
中的lubridate
来执行第2步(这对于多种适用的日期格式很有用):
案例1:
library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"
对于某些情况,parse_date_time
也适用于包含日期的较大字符串。例如:
案例2:
parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"
但是,据我所知,第2步不能直接对完整的示例文本起作用:
案例3:
parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA
显然,文中的一些附加信息使得直接从全文解析日期变得很麻烦。我可以想到一种方法,其中使用正则表达式执行步骤1以提取包含日期和parse_date_time
工作的简化字符串(类似于案例1或案例2)。但是,使用正则表达式与日期相关似乎总是有点脏,因为正则表达式不知道它是否提取有效日期。
有没有办法在非结构化文本上直接执行第2步(即没有基于正则表达式的解决方法),如上例所示(案例3)?
非常感谢任何输入!
答案 0 :(得分:0)
使用this website,我们可以构建一些正则表达式代码:(
( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+
)但是它 在R ...中不起作用:(
如果纠正,它确实有效。
> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"