解析模糊的时间戳

时间:2014-08-19 14:59:33

标签: r datetime

我提供了一个带有模糊日期格式的数据集,例如:

d_raw <- c("1102001 23:00", "1112001 0:00")

我想尝试将此日期解析为R中的POSIXlt对象。文件的来源向我保证文件按时间顺序排列,日期格式为月,然后是日,然后是年,那里在时间序列中没有差距。

有没有办法解析这种日期格式,使用排序来解决歧义?例如。上面的第一个元素应该解析为c("2001-01-10 23:00:00", "2001-01-11 00:00:00")而不是c("2001-01-10 23:00:00", "2001-11-01 00:00:00")

2 个答案:

答案 0 :(得分:3)

这个怎么样(使用正则表达式)

d_raw <- c("192001 16:00", "1102001 23:00", "1112001 0:00")

re <- "^(.+?)([1-9]|[1-3][0-9])(\\d{4}) (\\d{1,2}):(\\d{2})$"
m <- regexec(re, d_raw)
parts <- regmatches(d_raw, m)
lapply(parts, function(x) {
    x<-as.numeric(x[-1])
    ISOdate(x[3], x[1], x[2], x[4], x[5])
})

# [[1]]
# [1] "2001-01-09 16:00:00 GMT"
# 
# [[2]]
# [1] "2001-01-10 23:00:00 GMT"
# 
# [[3]]
# [1] "2001-01-11 GMT"

如果您有更多测试用例,只是为了确保正则表达式正常工作。

答案 1 :(得分:2)

我怜悯你这个可怕的数据供应商,所以我决定尝试为你解决这个问题。

# make up some horrid data
d_bad <- as.POSIXlt(seq(as.Date("2014-01-01"), as.Date("2014-12-31"), by=1))
d_raw <- paste0(d_bad$mon+1, d_bad$mday, d_bad$year+1900)

d_new <- d_raw
# not ambiguous when nchar is 6
d_new <- ifelse(nchar(d_new)==6,
  paste0("0", substr(d_new,1,1), "0", substr(d_new,2,nchar(d_new))), d_new)
# now not ambiguous when nchar is 7 and it doesn't begin with a "1"
d_new <- ifelse(nchar(d_new)==7 & substr(d_new,1,1) != "1",
  paste0("0",d_new), d_new)
# now guess a leading zero and parse
d_new <- ifelse(nchar(d_new)==7, paste0("0",d_new), d_new)
d_try <- as.Date(d_new, "%m%d%Y")

# now only days in October, November, and December might be wrong
bad <- cumsum(c(1L,as.integer(diff(d_try)))-1L) < 0L
# put the leading zero in the day, but remember "bad" rows have an
# extra leading zero, so make sure to skip it
d_try2 <- ifelse(bad,
  paste0(substr(d_new,2,3),"0", substr(d_new,4,nchar(d_new))), d_new)
# convert to Date, POSIXlt, whatever and do a happy dance
d_YAY <- as.Date(d_try2, "%m%d%Y")

data.frame(d_raw, d_new, d_try, bad, d_try2, d_YAY)
#        d_raw    d_new      d_try   bad   d_try2      d_YAY
# 1     112014 01012014 2014-01-01 FALSE 01012014 2014-01-01
# 2     122014 01022014 2014-01-02 FALSE 01022014 2014-01-02
# 3     132014 01032014 2014-01-03 FALSE 01032014 2014-01-03
# 4     142014 01042014 2014-01-04 FALSE 01042014 2014-01-04
# 5     152014 01052014 2014-01-05 FALSE 01052014 2014-01-05
# 6     162014 01062014 2014-01-06 FALSE 01062014 2014-01-06

我只使用Date执行此操作以保持示例数据集较小。对POSIXlt执行此操作非常相似,除非您需要将as.Date调用更改为as.POSIxlt并相应地调整格式。