我提供了一个带有模糊日期格式的数据集,例如:
d_raw <- c("1102001 23:00", "1112001 0:00")
我想尝试将此日期解析为R中的POSIXlt对象。文件的来源向我保证文件按时间顺序排列,日期格式为月,然后是日,然后是年,那里在时间序列中没有差距。
有没有办法解析这种日期格式,使用排序来解决歧义?例如。上面的第一个元素应该解析为c("2001-01-10 23:00:00", "2001-01-11 00:00:00")
而不是c("2001-01-10 23:00:00", "2001-11-01 00:00:00")
。
答案 0 :(得分:3)
这个怎么样(使用正则表达式)
d_raw <- c("192001 16:00", "1102001 23:00", "1112001 0:00")
re <- "^(.+?)([1-9]|[1-3][0-9])(\\d{4}) (\\d{1,2}):(\\d{2})$"
m <- regexec(re, d_raw)
parts <- regmatches(d_raw, m)
lapply(parts, function(x) {
x<-as.numeric(x[-1])
ISOdate(x[3], x[1], x[2], x[4], x[5])
})
# [[1]]
# [1] "2001-01-09 16:00:00 GMT"
#
# [[2]]
# [1] "2001-01-10 23:00:00 GMT"
#
# [[3]]
# [1] "2001-01-11 GMT"
如果您有更多测试用例,只是为了确保正则表达式正常工作。
答案 1 :(得分:2)
我怜悯你这个可怕的数据供应商,所以我决定尝试为你解决这个问题。
# make up some horrid data
d_bad <- as.POSIXlt(seq(as.Date("2014-01-01"), as.Date("2014-12-31"), by=1))
d_raw <- paste0(d_bad$mon+1, d_bad$mday, d_bad$year+1900)
d_new <- d_raw
# not ambiguous when nchar is 6
d_new <- ifelse(nchar(d_new)==6,
paste0("0", substr(d_new,1,1), "0", substr(d_new,2,nchar(d_new))), d_new)
# now not ambiguous when nchar is 7 and it doesn't begin with a "1"
d_new <- ifelse(nchar(d_new)==7 & substr(d_new,1,1) != "1",
paste0("0",d_new), d_new)
# now guess a leading zero and parse
d_new <- ifelse(nchar(d_new)==7, paste0("0",d_new), d_new)
d_try <- as.Date(d_new, "%m%d%Y")
# now only days in October, November, and December might be wrong
bad <- cumsum(c(1L,as.integer(diff(d_try)))-1L) < 0L
# put the leading zero in the day, but remember "bad" rows have an
# extra leading zero, so make sure to skip it
d_try2 <- ifelse(bad,
paste0(substr(d_new,2,3),"0", substr(d_new,4,nchar(d_new))), d_new)
# convert to Date, POSIXlt, whatever and do a happy dance
d_YAY <- as.Date(d_try2, "%m%d%Y")
data.frame(d_raw, d_new, d_try, bad, d_try2, d_YAY)
# d_raw d_new d_try bad d_try2 d_YAY
# 1 112014 01012014 2014-01-01 FALSE 01012014 2014-01-01
# 2 122014 01022014 2014-01-02 FALSE 01022014 2014-01-02
# 3 132014 01032014 2014-01-03 FALSE 01032014 2014-01-03
# 4 142014 01042014 2014-01-04 FALSE 01042014 2014-01-04
# 5 152014 01052014 2014-01-05 FALSE 01052014 2014-01-05
# 6 162014 01062014 2014-01-06 FALSE 01062014 2014-01-06
我只使用Date
执行此操作以保持示例数据集较小。对POSIXlt
执行此操作非常相似,除非您需要将as.Date
调用更改为as.POSIxlt
并相应地调整格式。