从基础提取正则表达式匹配角色向量

时间:2015-05-19 10:56:22

标签: regex r

说我有一个角色矢量测试:

test = c("2014-03-02","2012-09-08","2010-12-11")

我希望结果是数字年代如此:

c(2014,2012,2010)

我如何在一个简单的'办法?目前以下工作正常,但不是很漂亮':

test    = c("2014-03-02","2012-09-08","2010-12-11")
tmp     = strsplit(test,split="-")
myYears = as.numeric(unlist(lapply(tmp, function(x) x[[1]])))

我确信这可以使用正则表达式以不同的方式完成" \ d {4}"在某种程度上?

5 个答案:

答案 0 :(得分:3)

你可以尝试:

as.numeric(substr(test,1,4))

或者:

as.numeric(gsub("^([0-9]{4}).+$","\\1",test))

另一种选择:

as.numeric(strftime(test,format="%Y"))

或者:

as.POSIXlt(test)$year+1900

答案 1 :(得分:2)

您可以使用sub功能。只需将第一个-中的所有字符替换为空字符串中的最后一个字符。

> test = c("2014-03-02","2012-09-08","2010-12-11")
> sub("-.*", "", test)
[1] "2014" "2012" "2010"
> as.numeric(sub("-.*", "", test))
[1] 2014 2012 2010

DEMO

答案 2 :(得分:0)

为什么不执行以下操作,从而完成您在一行中所做的一切:

as.numeric(sapply(strsplit(test, "-"), '[', 1))

这1.使用 - 分割矢量 - 2.选择第一项并将其简化为矢量3.将其转换为数字

答案 3 :(得分:0)

test = c("2014-03-02","2012-09-08","2010-12-11")

years <- as.numeric(regmatches(test , regexpr("^\\d+" , test)))

解释

test = c("2014-03-02","2012-09-08","2010-12-11")

# this function return two things
# 1. The index of the first match for this regular expression
# 2. the length of characters that matches our regular expressions 
indices <- regexpr("^\\d+" , test)
# [1] 1 1 1
#attr(,"match.length")
#[1] 4 4 4

# 1. in this case the index of our first match is the first character for each date
# as you see in the result it returns 1
# 2. our regular expression matches 4 didgits from the beginning of string 
# so the length in this case is 4


# then we have now indices variable which represents the index of the first matched 
# character , and how many characters it matches  starting from the first match , 
# then pass this to regmatches function
# this will use the result of indices to to get only the matched part from our input
matches <- regmatches(test , indices)
# [1] "2014" "2012" "2010"

# Look again at indices variable , regmatches() will substring 
# the input starting from the first index , and how many characters 
# it will substring ? only 4 characters based on the result we get from 
# indices variables

希望这能澄清代码

答案 4 :(得分:0)

即使您要求base R解决方案,我也只是想让您知道有一种非常简单的方法可以使用函数numericyear直接提取年份{ {3}}:

test = c("2014-03-02","2012-09-08","2010-12-11")

library(lubridate)    
year_test <- year(test)

year_test
#[1] 2014 2012 2010
is.numeric(year_test)
# [1] TRUE