将字符向量中的日期替换为特定格式

时间:2017-04-06 13:41:14

标签: r

我获得了以下字符向量:

"On the evening of 2017-04-23, I was too tired"
"to complete my homework that was due on 24.04.2017."

我需要搜索所有日期,并将其替换为Monthname D,YYYY格式。

我知道一般格式应该是%B%d,%Y并且我可能必须使用sub()函数,但我不太确定如何将两者结合在一起。

当我尝试类似

sub("[0-9]{2}.[0-9]{2}.[0-9]{4}","%B %d, %Y",x)

我得到以下结果

"On the evening of 2001-01-15, I was too tired to complete my homework that was due on %B %d, %Y."

有人可以帮我弄清楚如何将它们整合在一起吗?

我的新代码在stackoverflowers的帮助下如下:

streamlineDates(x)
{
#set pattern to dates in form of YYYY-MM-DD or DD.MM.YYYY
pattern <- "\\d{2,4}[.-]\\d{2}[.-]\\d{2,4}"

y <- c(x)

val <- unlist(regmatches(y, gregexpr(pattern, y)))

val1 <- as.Date(val,format=c("%Y-%m-%d","%d.%m.%Y"))
val2 <- format(val1,"%B %d, %Y")

y1 <- list()
for (i in 1:length(y)){
  y1[i] <- gsub(pattern,val2[i],y[i])
}
}

但是,当我只输入时:

x <- "to complete my homework that was due on 24.04.2017."

......它只返回NA。我已将问题缩小到gsub,其中替换值,“如果NA,则结果中与匹配相对应的所有元素都将设置为NA”。因此,当仅输入最后一行时缺少第一个日期,它仅返回NA。

如何让它接受其中一个或两个日期?

2 个答案:

答案 0 :(得分:2)

第一种方法:

使用 BASE R 解决方案(不使用任何软件包):

pattern <- "\\d{2,4}[.-]\\d{2}[.-]\\d{2,4}"
rep <- c("On the evening of 2017-04-23, I was too tired","to complete my homework that was due on 24.04.2017.")


val <- unlist(regmatches(rep, gregexpr(pattern, rep)))

val1 <- as.Date(val,format=c("%Y-%m-%d","%d.%m.%Y"))
val2 <- format(val1,"%B %d, %Y")
val2
rep1 <- list()
for (i in 1:length(rep)){
rep1[i] <- gsub(pattern,val2[i],rep[i])
}

<强>答案:

do.call("c",rep1)

> do.call("c",rep1)                                                   
[1] "On the evening of April 23, 2017, I was too tired"      
[2] "to complete my homework that was due on April 24, 2017."
> 

第二种方法:

使用库stringr

library(stringr)
rep <- c("On the evening of 2017-04-23, I was too tired","to complete my homework that was due on 24.04.2017.")
val <- str_extract(rep,"\\d{2,4}[.-]\\d{2}[.-]\\d{2,4}")
val1 <- as.Date(val,format=c("%Y-%m-%d","%d.%m.%Y"))
val2 <- format(val1,"%B %d, %Y")
rep1 <- str_replace_all(rep,"\\d{2,4}[.-]\\d{2}[.-]\\d{2,4}",val2)
rep1

<强>答案:

> rep1
[1] "On the evening of April 23, 2017, I was too tired"      
[2] "to complete my homework that was due on April 24, 2017."
> 

编辑在OP稍微改变了问题之后,解决方案更通用了,但是假设月份总是在中间,而分隔符仅限于破折号( - )和点(。):

pattern <- "\\d{2,4}[.-]\\d{2}[.-]\\d{2,4}"
rep <- c("On the evening of 2017-04-23, I was too tired","to complete my homework that was due on 24.04.2017.")


val <- unlist(regmatches(rep, gregexpr(pattern, rep)))

year <- regmatches(val, gregexpr("\\d{4}", val))

month <- regmatches(val, gregexpr("(?<=[.-])\\d{1,2}(?=[.-])", val,perl=T))

date <- regmatches(val, gregexpr("(?<=[.-])\\d{2}$|^\\d{2}(?=[.-])", val,perl=T))
#Extracting year month and date , assuming month always falls in middle string

date1 <- paste0(year,"-",month,"-",date)
date1 <- as.Date(date1,"%Y-%m-%d")
val2 <- format(date1,"%B %d, %Y")

rep1 <- list()
for (i in 1:length(rep)){
  rep1[i] <- gsub(pattern,val2[i],rep[i])
}


do.call("c",rep1) 

答案 1 :(得分:1)

首先,您需要指定日期的所有格式。然后转换为日期,并使用格式来获​​得所需的输出,即

#Note that I don't use any delimiter in the formatting simply because 
#I will use gsub to replace all except the numbers with '' from the string
v1 <- c('%Y%m%d', '%d%m%Y')

format(as.Date(gsub('\\D+', '', x), format = v1), "%B %d, %Y")
#[1] "April 23, 2017" "April 24, 2017"

你可以使用str_replace_all stringr包中的(一个相当难看的)正则表达式,即

stringr::str_replace_all(x, '\\d+-\\d+-\\d+|\\d+\\.\\d+\\.\\d+', 
                         format(as.Date(gsub('\\D+', '', x), format = v1), "%B %d, %Y"))

#[1] "On the evening of April 23, 2017, I was too tired"      
#[2] "to complete my homework that was due on April 24, 2017."