我有一个很大的df,日期被意外强制插入错误的格式。
数据:
id <- c(1:12)
date <- c("2014-01-03","2001-08-14","2001-08-14","2014-06-02","2006-06-14", "2006-06-14",
"2014-08-08","2014-08-08","2008-04-14","2009-12-13","2010-09-14","2012-09-14")
df <- data.frame(id,date)
结构:
id date
1 1 2014-01-03
2 2 2001-08-14
3 3 2001-08-14
4 4 2014-06-02
5 5 2006-06-14
6 6 2006-06-14
7 7 2014-08-08
8 8 2014-08-08
9 9 2008-04-14
10 10 2009-12-13
11 11 2010-09-14
12 12 2012-09-14
数据集仅包含 仅包含年份2014
和2013
。日期2001-08-14
和2006-06-14
最有可能分别是2014-08-01
和2014-06-06
。
输出:
id date
1 1 2014-01-03
2 2 2014-08-01
3 3 2014-08-01
4 4 2014-06-02
5 5 2014-06-06
6 6 2014-06-06
7 7 2014-08-08
8 8 2014-08-08
9 9 2014-04-08
10 10 2013-12-09
11 11 2014-09-10
12 12 2014-09-12
我怎样才能调和这个烂摊子?
答案 0 :(得分:3)
包lubridate
具有方便的功能year
,此处非常有用。
library(lubridate)
# Convert date to proper date class variable
df$date <- as.Date(df$date)
# Isolate problematic indices; when year is not in 2013 or 2014,
# we'll go to and from character representation. We'll trim
# the "20" in front of the "false year" and then specify the
# proper format to read the character back into a Date class.
tmp.indices <- which(!year(df$date) %in% c("2013", "2014"))
df$date[tmp.indices] <- as.Date(substring(as.character(df$date[tmp.indices]),
first = 3), format = "%d-%m-%y")
结果:
id date
1 1 2014-01-03
2 2 2014-08-01
3 3 2014-08-01
4 4 2014-06-02
5 5 2014-06-06
6 6 2014-06-06
7 7 2014-08-08
8 8 2014-08-08
9 9 2014-04-08
10 10 2013-12-09
11 11 2014-09-10
12 12 2014-09-12
答案 1 :(得分:2)
我们可以将'date'列转换为'Date'类,提取'year'以创建2013年,2014年的逻辑索引('indx')。
df$date <- as.Date(df$date)
indx <- !format(df$date, '%Y') %in% 2013:2014
使用lubridate
,删除前两个字符后,使用dmy
转换为“日期”类。
library(lubridate)
df$date[indx] <- dmy(sub('^..', '', df$date[indx]))
df
# id date
#1 1 2014-01-03
#2 2 2014-08-01
#3 3 2014-08-01
#4 4 2014-06-02
#5 5 2014-06-06
#6 6 2014-06-06
#7 7 2014-08-08
#8 8 2014-08-08
#9 9 2014-04-08
#10 10 2013-12-09
#11 11 2014-09-10
#12 12 2014-09-12