我已经删除了HTML,现在我有这样的行:
rows
1: for the Year Ended 31 March 2013
我想只提取表达式"31 March 2013"
。表达式周围的文字可能有所不同。该表达式将转换为日期格式,最好是31-3-2013
如何解决这个问题?
答案 0 :(得分:3)
如果字符串中没有其他数字,您可以使用以下方法:
string <- "for the Year Ended 31 March 2013"
format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string),
"%d %B %Y"), "%d-%m-%Y")
# [1] "31-03-2013"
此处sub
提取相关子字符串,as.Date
创建表示Date
值的对象,format
更改日期元素的顺序。
它还适用于其他文字和一位数日:
string <- c("for the Year Ended 31 March 2013",
"1 January 2013 the Year Began",
"for the Year Ended 31 March 2013 and not now")
format(as.Date(sub(".*?(\\d+ \\w+ \\d+).*", "\\1", string),
"%d %b %Y"), "%d-%m-%Y")
# [1] "31-03-2013" "01-01-2013" "31-03-2013"
答案 1 :(得分:2)
另一种选择:
library(stringr)
library(lubridate)
dmy(str_extract(xx,'[0-9]{2}.*[0-9]{4}$'))
[1] "2013-03-31 UTC"
答案 2 :(得分:1)
rows <- c("for the Year Ended 31 March 2013 ... 31 March 2013 ...",
"for the Year Ended 1 December 2011")
m <- gregexpr("[0-9]+ [A-z]+ [0-9]{4}", rows)
# Sys.setlocale("LC_TIME", "english")
(res <- lapply(regmatches(rows, m), as.Date, "%d %B %Y"))
# [[1]]
# [1] "2013-03-31" "2013-03-31"
#
# [[2]]
# [1] "2011-12-01"
lapply(res, format.Date, "%d-%m-%Y") # or "%d-%e-%Y"
# [[1]]
# [1] "31-03-2013" "31-03-2013"
#
# [[2]]
# [1] "01-12-2011"