正则表达式只保留单词R之前的最后一个数字(可重复)

时间:2018-06-12 04:57:59

标签: r regex dataframe

我有以下格式的日期,数字+日期和NA列,我只想保留日期部分,删除R中的其他数字可能使用sub或gsub?很高兴接受一个帮助我的答案:)

df <- data.frame(a=c(1:11), datecol=c("11 June 2018", NA, NA, "400 10 June 2017",NA,"5 05 June 2018", NA, NA, NA, NA, "25 15 May 2016"))

df.desired <- data.frame(a=c(1:11), datecol=c("11 June 2018", NA, NA, "10 June 2017",NA,"05 June 2018", NA, NA, NA, NA, "15 May 2016"))

2 个答案:

答案 0 :(得分:4)

我们可以使用sub来匹配1或2位数字(\\d{1,2})的模式,后跟空格,单词(\\w+)表示月份,空格和最后4位数字代表&#39;年&#39;,作为一个群体捕获,并在替换中使用反对捕获的群组

sub(".*\\s+(\\d{1,2}.*\\w+\\s+\\d{4}$)", "\\1", df$datecol)
#[1] "11 June 2018" NA             NA             "10 June 2017" NA            
#[6] "05 June 2018" NA             NA             NA             NA            
#[11] "15 May 2016" 

答案 1 :(得分:2)

您也可以使用stringr包:

stringr::str_extract(df$datecol,"\\d{1,2}\\s+[a-zA-Z]+\\s+\\d{4}")

<强>输出

> stringr::str_extract(df$datecol,"\\d{1,2}\\s+[a-zA-Z]+\\s+\\d{4}")
 [1] "11 June 2018" NA             NA             "10 June 2017"
 [5] NA             "05 June 2018" NA             NA            
 [9] NA             NA             "15 May 2016"