几个月前,我写了一个R脚本,其中的一部分将字符日期转换为日期格式。
我最初遇到这个问题,当我将字符转换为日期格式时引入了NA
。
有人建议,发生这种情况的原因是,它必须期望日期的day
元素为两个字符,例如June 12th 2018
-并且只有在{{1} }元素包含一个字符-例如day
。
提供的解决方案(June 2nd 2018
)运作良好。
直到现在。
我不仅获得as.Date(df$date, format='%B %d %Y')
值,而且还收到错误:NA
。
我不知道这意味着什么-有人可以解释吗?
这是原始数据帧:
Error: Duplicate identifiers for rows (12, 14), (13, 16)
这是原始数据:
time.per.day Top.0.type Count
1 July 27th 2018, 00:00:00.000 conversation-archived 2
2 July 27th 2018, 00:00:00.000 conversation-archived 1
3 July 28th 2018, 00:00:00.000 conversation-archived 4
4 July 28th 2018, 00:00:00.000 conversation-archived 1
5 July 29th 2018, 00:00:00.000 conversation-archived 2
6 July 29th 2018, 00:00:00.000 conversation-archived 2
7 July 29th 2018, 00:00:00.000 conversation-auto-archived 2
8 July 30th 2018, 00:00:00.000 conversation-archived 3
9 July 30th 2018, 00:00:00.000 conversation-archived 2
10 July 30th 2018, 00:00:00.000 conversation-auto-archived 1
11 July 31st 2018, 00:00:00.000 conversation-archived 1
12 August 1st 2018, 00:00:00.000 conversation-archived 1
13 August 1st 2018, 00:00:00.000 conversation-auto-archived 1
14 August 2nd 2018, 00:00:00.000 conversation-archived 4
15 August 2nd 2018, 00:00:00.000 conversation-archived 1
16 August 2nd 2018, 00:00:00.000 conversation-auto-archived 2
我重命名了列(df <- structure(list(time.per.day = c("July 27th 2018, 00:00:00.000",
"July 27th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000",
"July 28th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000",
"July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000",
"July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000",
"July 30th 2018, 00:00:00.000", "July 31st 2018, 00:00:00.000",
"August 1st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000",
"August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000",
"August 2nd 2018, 00:00:00.000"), Top.0.type = c("conversation-archived",
"conversation-archived", "conversation-archived", "conversation-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived",
"conversation-archived", "conversation-archived", "conversation-auto-archived"
), Count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L,
1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L
))
)并以某种方式处理数据,但是现在使用colnames(df) <- c("date", "type", "retailer_code", "count")
之后,还要进行一些其他维护:
as.Date(df$date, format='%B %d %Y')
这是结果数据帧:
# Remove time and identifiers from date column
df$date <- gsub(", 00:00:00.000", "", df$date)
df$date <- gsub("st", "", df$date)
df$date <- gsub("nd", "", df$date)
df$date <- gsub("rd", "", df$date)
df$date <- gsub("th", "", df$date)
这是结果数据帧的 date type count
1 2018-07-27 Completed 2
2 2018-07-27 Completed 1
3 2018-07-28 Completed 4
4 2018-07-28 Completed 1
5 2018-07-29 Completed 2
6 2018-07-29 Completed 2
7 2018-07-29 Missed 2
8 2018-07-30 Completed 3
9 2018-07-30 Completed 2
10 2018-07-30 Missed 1
11 2018-07-31 Completed 1
12 <NA> Completed 1
13 <NA> Missed 1
14 <NA> Completed 4
15 <NA> Completed 1
16 <NA> Missed 2
:
dput
为什么现在出问题了?
引起我注意,df <- structure(list(date = structure(c(17739, 17739, 17740, 17740,
17741, 17741, 17741, 17742, 17742, 17742, 17743, NA, NA, NA,
NA, NA), class = "Date"), type = c("Completed", "Completed",
"Completed", "Completed", "Completed", "Completed", "Missed",
"Completed", "Completed", "Missed", "Completed", "Completed",
"Missed", "Completed", "Completed", "Missed"), count = c(2L,
1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-16L))
正在将df$date <- gsub("st", "", df$date)
转换为August
,因此导致出现NA值。
我将其更改为Augu
,但这现在导致结果数据帧(第12-16行(包括第12行))出现其他问题:
df$date <- gsub("1st", "", df$date)
如何解决?
答案 0 :(得分:2)
最初是
df$date <- gsub("st", "", df$date)
引起了问题,因为它与“ August”的“ st”以及“ 1st”匹配。为了克服这个问题,我们只需要用日期将“ 1st”替换为“ 1”即可。
df$date <- gsub("1st", "1", df$date)
,然后转换为日期。
as.Date(df$date, "%B %d %Y")
#[1] "2018-07-27" "2018-07-27" "2018-07-28" "2018-07-28" "2018-07-29" "2018-07-29"
#[7] "2018-07-29" "2018-07-30" "2018-07-30" "2018-07-30" "2018-07-31" "2018-08-01"
#[13] "2018-08-01" "2018-08-02" "2018-08-02" "2018-08-02"
理想情况下,硬编码和替换值不是一个会引起此类问题的好主意,因此,当一个数字后跟一个序数而不是4个单独的sub
时,我们可以替换这些值。< / p>
所以
df$date <- sub(", 00:00:00.000", "", df$date)
我们可以直接做
df$date <- sub("(\\d+)(st|nd|rd|th)\\b", "\\1", df$date)