Question

几个月前，我写了一个R脚本，其中的一部分将字符日期转换为日期格式。

我最初遇到这个问题，当我将字符转换为日期格式时引入了NA。

有人建议，发生这种情况的原因是，它必须期望日期的day元素为两个字符，例如June 12th 2018-并且只有在{{1} }元素包含一个字符-例如day。

提供的解决方案（June 2nd 2018）运作良好。

直到现在。

我不仅获得as.Date(df$date, format='%B %d %Y')值，而且还收到错误：NA。

我不知道这意味着什么-有人可以解释吗？

这是原始数据帧：

Error: Duplicate identifiers for rows (12, 14), (13, 16)

这是原始数据：

                    time.per.day                 Top.0.type Count
1   July 27th 2018, 00:00:00.000      conversation-archived     2
2   July 27th 2018, 00:00:00.000      conversation-archived     1
3   July 28th 2018, 00:00:00.000      conversation-archived     4
4   July 28th 2018, 00:00:00.000      conversation-archived     1
5   July 29th 2018, 00:00:00.000      conversation-archived     2
6   July 29th 2018, 00:00:00.000      conversation-archived     2
7   July 29th 2018, 00:00:00.000 conversation-auto-archived     2
8   July 30th 2018, 00:00:00.000      conversation-archived     3
9   July 30th 2018, 00:00:00.000      conversation-archived     2
10  July 30th 2018, 00:00:00.000 conversation-auto-archived     1
11  July 31st 2018, 00:00:00.000      conversation-archived     1
12 August 1st 2018, 00:00:00.000      conversation-archived     1
13 August 1st 2018, 00:00:00.000 conversation-auto-archived     1
14 August 2nd 2018, 00:00:00.000      conversation-archived     4
15 August 2nd 2018, 00:00:00.000      conversation-archived     1
16 August 2nd 2018, 00:00:00.000 conversation-auto-archived     2

我重命名了列（df <- structure(list(time.per.day = c("July 27th 2018, 00:00:00.000", "July 27th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000", "July 31st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000"), Top.0.type = c("conversation-archived", "conversation-archived", "conversation-archived", "conversation-archived", "conversation-archived", "conversation-archived", "conversation-auto-archived", "conversation-archived", "conversation-archived", "conversation-auto-archived", "conversation-archived", "conversation-archived", "conversation-auto-archived", "conversation-archived", "conversation-archived", "conversation-auto-archived" ), Count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L ))）并以某种方式处理数据，但是现在使用colnames(df) <- c("date", "type", "retailer_code", "count")之后，还要进行一些其他维护：

as.Date(df$date, format='%B %d %Y')

这是结果数据帧：

 # Remove time and identifiers from date column
df$date <- gsub(", 00:00:00.000", "", df$date)
df$date <- gsub("st", "", df$date)
df$date <- gsub("nd", "", df$date)
df$date <- gsub("rd", "", df$date)
df$date <- gsub("th", "", df$date)

这是结果数据帧的date type count 1 2018-07-27 Completed 2 2 2018-07-27 Completed 1 3 2018-07-28 Completed 4 4 2018-07-28 Completed 1 5 2018-07-29 Completed 2 6 2018-07-29 Completed 2 7 2018-07-29 Missed 2 8 2018-07-30 Completed 3 9 2018-07-30 Completed 2 10 2018-07-30 Missed 1 11 2018-07-31 Completed 1 12 <NA> Completed 1 13 <NA> Missed 1 14 <NA> Completed 4 15 <NA> Completed 1 16 <NA> Missed 2：

dput

为什么现在出问题了？

引起我注意，df <- structure(list(date = structure(c(17739, 17739, 17740, 17740, 17741, 17741, 17741, 17742, 17742, 17742, 17743, NA, NA, NA, NA, NA), class = "Date"), type = c("Completed", "Completed", "Completed", "Completed", "Completed", "Completed", "Missed", "Completed", "Completed", "Missed", "Completed", "Completed", "Missed", "Completed", "Completed", "Missed"), count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L))正在将df$date <- gsub("st", "", df$date)转换为August，因此导致出现NA值。

我将其更改为Augu，但这现在导致结果数据帧（第12-16行（包括第12行））出现其他问题：

df$date <- gsub("1st", "", df$date)

如何解决？

Answer 1

最初是

df$date <- gsub("st", "", df$date)

引起了问题，因为它与“ August”的“ st”以及“ 1st”匹配。为了克服这个问题，我们只需要用日期将“ 1st”替换为“ 1”即可。

df$date <- gsub("1st", "1", df$date)

，然后转换为日期。

as.Date(df$date, "%B %d %Y")

#[1]  "2018-07-27" "2018-07-27" "2018-07-28" "2018-07-28" "2018-07-29" "2018-07-29"
#[7]  "2018-07-29" "2018-07-30" "2018-07-30" "2018-07-30" "2018-07-31" "2018-08-01"
#[13] "2018-08-01" "2018-08-02" "2018-08-02" "2018-08-02"

理想情况下，硬编码和替换值不是一个会引起此类问题的好主意，因此，当一个数字后跟一个序数而不是4个单独的sub时，我们可以替换这些值。< / p>

所以

df$date <- sub(", 00:00:00.000", "", df$date)

我们可以直接做

df$date <- sub("(\\d+)(st|nd|rd|th)\\b", "\\1", df$date)

尽管没有任何更改，为什么我的日期转换解决方案不再起作用？

1 个答案: