Question

我有一个很大的df，日期被意外强制插入错误的格式。

数据：

id <- c(1:12)  
date <- c("2014-01-03","2001-08-14","2001-08-14","2014-06-02","2006-06-14", "2006-06-14",
          "2014-08-08","2014-08-08","2008-04-14","2009-12-13","2010-09-14","2012-09-14")
df <- data.frame(id,date)

结构：

    id  date
1   1   2014-01-03
2   2   2001-08-14
3   3   2001-08-14
4   4   2014-06-02
5   5   2006-06-14
6   6   2006-06-14
7   7   2014-08-08
8   8   2014-08-08
9   9   2008-04-14
10  10  2009-12-13
11  11  2010-09-14
12  12  2012-09-14

数据集仅包含仅包含年份2014和2013。日期2001-08-14和2006-06-14最有可能分别是2014-08-01和2014-06-06。

输出：

    id  date
1   1   2014-01-03
2   2   2014-08-01
3   3   2014-08-01
4   4   2014-06-02
5   5   2014-06-06
6   6   2014-06-06
7   7   2014-08-08
8   8   2014-08-08
9   9   2014-04-08
10  10  2013-12-09
11  11  2014-09-10
12  12  2014-09-12

我怎样才能调和这个烂摊子？

Answer 1

包lubridate具有方便的功能year，此处非常有用。

library(lubridate)

# Convert date to proper date class variable
df$date <- as.Date(df$date)

# Isolate problematic indices; when year is not in 2013 or 2014,
# we'll go to and from character representation. We'll trim
# the "20" in front of the "false year" and then specify the 
# proper format to read the character back into a Date class.

tmp.indices <- which(!year(df$date) %in% c("2013", "2014"))
df$date[tmp.indices] <- as.Date(substring(as.character(df$date[tmp.indices]),
                                first = 3), format = "%d-%m-%y")

结果：

   id       date
1   1 2014-01-03
2   2 2014-08-01
3   3 2014-08-01
4   4 2014-06-02
5   5 2014-06-06
6   6 2014-06-06
7   7 2014-08-08
8   8 2014-08-08
9   9 2014-04-08
10 10 2013-12-09
11 11 2014-09-10
12 12 2014-09-12

Answer 2

我们可以将'date'列转换为'Date'类，提取'year'以创建2013年，2014年的逻辑索引（'indx'）。

df$date <- as.Date(df$date)
indx <- !format(df$date, '%Y') %in% 2013:2014

使用lubridate，删除前两个字符后，使用dmy转换为“日期”类。

library(lubridate)
df$date[indx] <- dmy(sub('^..', '', df$date[indx]))
df
#   id       date
#1   1 2014-01-03
#2   2 2014-08-01
#3   3 2014-08-01
#4   4 2014-06-02
#5   5 2014-06-06
#6   6 2014-06-06
#7   7 2014-08-08
#8   8 2014-08-08
#9   9 2014-04-08
#10 10 2013-12-09
#11 11 2014-09-10
#12 12 2014-09-12

修复被强制为错误格式的日期

2 个答案: