尽管没有任何更改,为什么我的日期转换解决方案不再起作用?

时间:2018-08-03 08:37:15

标签: r date date-conversion

几个月前,我写了一个R脚本,其中的一部分将字符日期转换为日期格式。

我最初遇到这个问题,当我将字符转换为日期格式时引入了NA

有人建议,发生这种情况的原因是,它必须期望日期的day元素为两个字符,例如June 12th 2018-并且只有在{{1} }元素包含一个字符-例如day

提供的解决方案(June 2nd 2018)运作良好。

直到现在。

我不仅获得as.Date(df$date, format='%B %d %Y')值,而且还收到错误:NA

我不知道这意味着什么-有人可以解释吗?

这是原始数据帧:

Error: Duplicate identifiers for rows (12, 14), (13, 16)

这是原始数据:

                    time.per.day                 Top.0.type Count
1   July 27th 2018, 00:00:00.000      conversation-archived     2
2   July 27th 2018, 00:00:00.000      conversation-archived     1
3   July 28th 2018, 00:00:00.000      conversation-archived     4
4   July 28th 2018, 00:00:00.000      conversation-archived     1
5   July 29th 2018, 00:00:00.000      conversation-archived     2
6   July 29th 2018, 00:00:00.000      conversation-archived     2
7   July 29th 2018, 00:00:00.000 conversation-auto-archived     2
8   July 30th 2018, 00:00:00.000      conversation-archived     3
9   July 30th 2018, 00:00:00.000      conversation-archived     2
10  July 30th 2018, 00:00:00.000 conversation-auto-archived     1
11  July 31st 2018, 00:00:00.000      conversation-archived     1
12 August 1st 2018, 00:00:00.000      conversation-archived     1
13 August 1st 2018, 00:00:00.000 conversation-auto-archived     1
14 August 2nd 2018, 00:00:00.000      conversation-archived     4
15 August 2nd 2018, 00:00:00.000      conversation-archived     1
16 August 2nd 2018, 00:00:00.000 conversation-auto-archived     2

我重命名了列(df <- structure(list(time.per.day = c("July 27th 2018, 00:00:00.000", "July 27th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000", "July 28th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", "July 29th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000", "July 30th 2018, 00:00:00.000", "July 31st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000", "August 1st 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000", "August 2nd 2018, 00:00:00.000"), Top.0.type = c("conversation-archived", "conversation-archived", "conversation-archived", "conversation-archived", "conversation-archived", "conversation-archived", "conversation-auto-archived", "conversation-archived", "conversation-archived", "conversation-auto-archived", "conversation-archived", "conversation-archived", "conversation-auto-archived", "conversation-archived", "conversation-archived", "conversation-auto-archived" ), Count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L )) )并以某种方式处理数据,但是现在使用colnames(df) <- c("date", "type", "retailer_code", "count")之后,还要进行一些其他维护:

as.Date(df$date, format='%B %d %Y')

这是结果数据帧:

 # Remove time and identifiers from date column
df$date <- gsub(", 00:00:00.000", "", df$date)
df$date <- gsub("st", "", df$date)
df$date <- gsub("nd", "", df$date)
df$date <- gsub("rd", "", df$date)
df$date <- gsub("th", "", df$date)

这是结果数据帧的 date type count 1 2018-07-27 Completed 2 2 2018-07-27 Completed 1 3 2018-07-28 Completed 4 4 2018-07-28 Completed 1 5 2018-07-29 Completed 2 6 2018-07-29 Completed 2 7 2018-07-29 Missed 2 8 2018-07-30 Completed 3 9 2018-07-30 Completed 2 10 2018-07-30 Missed 1 11 2018-07-31 Completed 1 12 <NA> Completed 1 13 <NA> Missed 1 14 <NA> Completed 4 15 <NA> Completed 1 16 <NA> Missed 2

dput

为什么现在出问题了?


引起我注意,df <- structure(list(date = structure(c(17739, 17739, 17740, 17740, 17741, 17741, 17741, 17742, 17742, 17742, 17743, NA, NA, NA, NA, NA), class = "Date"), type = c("Completed", "Completed", "Completed", "Completed", "Completed", "Completed", "Missed", "Completed", "Completed", "Missed", "Completed", "Completed", "Missed", "Completed", "Completed", "Missed"), count = c(2L, 1L, 4L, 1L, 2L, 2L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 4L, 1L, 2L)), class = "data.frame", row.names = c(NA, -16L)) 正在将df$date <- gsub("st", "", df$date)转换为August,因此导致出现NA值。

我将其更改为Augu,但这现在导致结果数据帧(第12-16行(包括第12行))出现其他问题:

df$date <- gsub("1st", "", df$date)

如何解决?

1 个答案:

答案 0 :(得分:2)

最初是

df$date <- gsub("st", "", df$date)

引起了问题,因为它与“ August”的“ st”以及“ 1st”匹配。为了克服这个问题,我们只需要用日期将“ 1st”替换为“ 1”即可。

df$date <- gsub("1st", "1", df$date)

,然后转换为日期。

as.Date(df$date, "%B %d %Y")

#[1]  "2018-07-27" "2018-07-27" "2018-07-28" "2018-07-28" "2018-07-29" "2018-07-29"
#[7]  "2018-07-29" "2018-07-30" "2018-07-30" "2018-07-30" "2018-07-31" "2018-08-01"
#[13] "2018-08-01" "2018-08-02" "2018-08-02" "2018-08-02"

理想情况下,硬编码和替换值不是一个会引起此类问题的好主意,因此,当一个数字后跟一个序数而不是4个单独的sub时,我们可以替换这些值。< / p>

所以

df$date <- sub(", 00:00:00.000", "", df$date)

我们可以直接做

df$date <- sub("(\\d+)(st|nd|rd|th)\\b", "\\1", df$date)