Question

我反复遇到问题，要解析日期嵌入文本的相对非结构化文本文档的日期，其位置和格式因具体情况而异。一些示例文本是：

"Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."

我想从文本中提取日期字符串"July 1st, 2015"（步骤1）并将其转换为格式，例如2015-07-01 UTC（步骤2）。例如，可以使用包parse_date_time中的lubridate来执行第2步（这对于多种适用的日期格式很有用）：

案例1：

library(lubridate)
parse_date_time("July 1st, 2015", "b d Y", local="C")
[1] "2015-07-01 UTC"

对于某些情况，parse_date_time也适用于包含日期的较大字符串。例如：

案例2：

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November", "b d Y" , local="C")
[1] "2015-07-01 UTC"

但是，据我所知，第2步不能直接对完整的示例文本起作用：

案例3：

parse_date_time("Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100.", "b d Y" , local="C")
[1] NA

显然，文中的一些附加信息使得直接从全文解析日期变得很麻烦。我可以想到一种方法，其中使用正则表达式执行步骤1以提取包含日期和parse_date_time工作的简化字符串（类似于案例1或案例2）。但是，使用正则表达式与日期相关似乎总是有点脏，因为正则表达式不知道它是否提取有效日期。

有没有办法在非结构化文本上直接执行第2步（即没有基于正则表达式的解决方法），如上例所示（案例3）？

非常感谢任何输入！

Answer 1

使用this website，我们可以构建一些正则表达式代码：（( [J, F, M, A, S, O, N, D])\w+ [1-31][th, st]\w+, [0-2100]\w+）但是它在R ...中不起作用:(

如果纠正，它确实有效。

> x = "Name of the city, name of the country, July 1st, 2015 - The group announces that it has completed the project initiated in November 2011. It has launched 12 other initiatives. The average revenue per initiative is USD 100."
> m = regexpr(' [JFMASOND]\\w+ ([1-9]|[12][0-9]|3[0-1])(th|rd|nd|st), [12]\\d{3}', x)
> if (m > 0) substr(x, m, m + attr(m, 'match.length') - 1)
[1] " July 1st, 2015"

从R中的文本解析日期

1 个答案: