使用正则表达式清理日期(特别是年份)

时间:2015-07-06 22:04:35

标签: regex r

我的数据库包含未经验证的年份字段。大多数条目是4位数年份,但大约10%的条目是“无论如何”。这让我对正规表达的兔子漏洞毫无用处。即使我不提取100%,也能取得比我更好的结果。

#what a mess
yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96  ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")
#does a good job with any string containing a 4-digit year
as.numeric(sub('\\D*(\\d{4}).*', '\\1', yearEntries))
#does a good job with any string containing a 2-digit year, nought else
as.numeric(sub('\\D*(\\d{2}).*', '\\1', yearEntries))

期望的输出是抓住第一个可读年份,因此1992-1993将是1992年,“70年代”将是1970年。

如何提高解析准确度?谢谢!

编辑:根据加里的回答,这让我更加接近:

sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\d)|\\d{4}).*","\\1",yearEntries,perl=TRUE)
# [1] "79"        "07-2608"   "07-262008" "96"        "70"        "93"        "70"        "15"        "60"       "70"        NA          "2013"      "1992"

但请注意,虽然其中带有破折号的日期与garyh的regex101.com演示一起使用,但它们不适用于R,保留月和日值以及第一个破折号。

此外,我意识到我没有包含斜线而不是破折号的示例日期。正则表达式中的另一个术语应该处理它,但是再次使用R,它不会产生与regex101.com相同的(correct) result

sub("\\D*((?<!\\d)\\d{2}(?!\\-|\\/|\\d)|\\d{4}).*","\\1","07/09/13",perl=TRUE)
# [1] "07/0913"

这些负面的回顾和前瞻是非常强大的,但伸展我的虚弱的大脑。

4 个答案:

答案 0 :(得分:2)

不确定正则表达式R的用途是什么,但这似乎在字符串

中得到了所有年份
/((?<!\d)\d{2}(?!\-|\d)|\d{4})/g

这是匹配任何4位任意2位数字,前提是它们后面没有短划线-或数字,或者前面是另一位数

请参阅demo here

答案 1 :(得分:1)

你需要一些肘部油脂,并做一些类似的事情:

library(lubridate)

yearEntries <- c("79, 80, 99","07-26-08","07-26-2008","'96  ","Early 70's","93/95","late 70's","15","late 60s","Late 70's",NA,"2013","1992-1993")

x <- yearEntries
x <- gsub("(late|early)", "", x, ignore.case=TRUE)
x <- gsub("[']*[s]*", "", x)
x <- gsub(",.*$", "", x)
x <- gsub(" ", "", x)
x <- ifelse(nchar(x)==9 | nchar(x)<8, gsub("[-/]+[[:digit:]]+$", "", x), x)
x <- ifelse(nchar(x)==4, gsub("^[[:digit:]]{2}", "", x), x)
y <- format(parse_date_time(x, "%m-%d-%y!"), "%y")

yearEntries <-ifelse(!is.na(y), y, x)

yearEntries
##  [1] "79" "08" "08" "96" "70" "93" "70" "15" "60" "70" NA   "13" "92"

我们不知道你需要从远程条目中找到哪一年,但这应该可以让你开始。

答案 2 :(得分:0)

我找到了一种非常简单的方法来获得一个好的结果(虽然我不会声称它是防弹)。它抓住了最后一年,也没关系。

yearEntries <- c("79, 80, 99","07/26/08","07-26-2008","'96  ","Early 70's","93/95","15",NA,"2013","1992-1993","ongoing")
# assume last two digits present in any string represent a 2-digit year 
a<-sub(".*(\\d{2}).*$","\\1",yearEntries)
#  [1] "99"      "08"      "08"      "96"      "70"      "95"      "15"      "ongoing" NA        "13"      "93"   
# change to numeric, strip NAs and add 2000
b<-na.omit(as.numeric(a))+2000
# [1] 2099 2008 2008 2096 2070 2095 2015 2013 2093
# assume any greater than present is last century
b[b>2015]<-b[b>2015]-100
#  [1] 1999 2008 2008 1996 1970 1995 2015 2013 1993

......鲍勃是你的叔叔!

答案 3 :(得分:0)

如果你使用var catIDs = new List<int>() { 1,2,3 }; var results = db.tblCategories .Where(t => catIDs.Contains(t.ID)) .SelectMany(t => t.tblProducts) .Distinct(); / regmatches组合来提取模式而不是grexpr

@ garyth的正则表达式工作得很好:

sub

仅保留第一个匹配模式:

regmatches(yearEntries, 
           gregexpr("(?<!\\d)\\d{2}(?!-|\\/|\\d)|\\d{4}",yearEntries,perl=TRUE))
[[1]]
[1] "79" "80" "99"

[[2]]
[1] "08"

[[3]]
[1] "2008"

[[4]]
[1] "96"

[[5]]
[1] "70"

[[6]]
[1] "95"

[[7]]
[1] "70"

[[8]]
[1] "15"

[[9]]
[1] "60"

[[10]]
[1] "70"

[[11]]
character(0)

[[12]]
[1] "2013"

[[13]]
[1] "1992" "1993"