从R中的文本解析日期

时间:2015-12-04 12:58:11

标签: regex r date parsing lubridate

我反复遇到问题,要解析日期嵌入文本的相对非结构化文本文档的日期,其位置和格式因具体情况而异。一些示例文本是:

"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."

我想从文本中提取日期字符串"July 1st, 2015"(步骤1)并将其转换为格式,例如2015-07-01 UTC(步骤2)。例如,可以使用包parse_date_time中的lubridate来执行第2步(这对于多种适用的日期格式很有用):

案例1:

library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"

对于某些情况,parse_date_time也适用于包含日期的较大字符串。例如:

案例2:

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"

但是,据我所知,第2步不能直接对完整的示例文本起作用:

案例3:

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA

显然,文中的一些附加信息使得直接从全文解析日期变得很麻烦。我可以想到一种方法,其中使用正则表达式执行步骤1以提取包含日期和parse_date_time工作的简化字符串(类似于案例1或案例2)。但是,使用正则表达式与日期相关似乎总是有点脏,因为正则表达式不知道它是否提取有效日期。

有没有办法在非结构化文本上直接执行第2步(即没有基于正则表达式的解决方法),如上例所示(案例3)?

非常感谢任何输入!

1 个答案:

答案 0 :(得分:0)

  

使用this website,我们可以构建一些正则表达式代码:(( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+)但是它   在R ...中不起作用:(

如果纠正,它确实有效。

> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"