在R中重新格式化刮擦日期

时间:2014-01-27 12:39:03

标签: regex r date

我已经删除了HTML,现在我有这样的行:

                               rows
1: for the Year Ended 31 March 2013

我想只提取表达式"31 March 2013"。表达式周围的文字可能有所不同。该表达式将转换为日期格式,最好是31-3-2013

如何解决这个问题?

3 个答案:

答案 0 :(得分:3)

如果字符串中没有其他数字,您可以使用以下方法:

string <- "for the Year Ended 31 March 2013"

format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string), 
               "%d %B %Y"), "%d-%m-%Y")
# [1] "31-03-2013"

此处sub提取相关子字符串,as.Date创建表示Date值的对象,format更改日期元素的顺序。


它还适用于其他文字和一位数日:

string <- c("for the Year Ended 31 March 2013",
            "1 January 2013 the Year Began",
            "for the Year Ended 31 March 2013 and not now")
format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string),
       "%d %b %Y"), "%d-%m-%Y")
# [1] "31-03-2013" "01-01-2013" "31-03-2013"

答案 1 :(得分:2)

另一种选择:

library(stringr)
library(lubridate)
dmy(str_extract(xx,'[0-9]{2}.*[0-9]{4}$'))
[1] "2013-03-31 UTC"

答案 2 :(得分:1)

rows <- c("for the Year Ended 31 March 2013 ... 31 March 2013 ...",
          "for the Year Ended 1 December 2011")
m <- gregexpr("[0-9]+ [A-z]+ [0-9]{4}", rows)
# Sys.setlocale("LC_TIME", "english")
(res <- lapply(regmatches(rows, m), as.Date, "%d %B %Y"))
# [[1]]
# [1] "2013-03-31" "2013-03-31"
# 
# [[2]]
# [1] "2011-12-01"
lapply(res, format.Date, "%d-%m-%Y") # or "%d-%e-%Y"
# [[1]]
# [1] "31-03-2013" "31-03-2013"
# 
# [[2]]
# [1] "01-12-2011"