混合日期格式的数据框

时间:2020-06-25 16:03:34

标签: r data-science data-munging data-wrangling

我想将所有混合日期格式更改为一种格式,例如d-m-y

这是数据框

x <- data.frame("Name" = c("A","B","C","D","E"), "Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))

我在这里尝试使用此代码,但它给出了NAs

newdateformat <- as.Date(x$Birthdate,
  format = "%m%d%y", origin = "2020-6-25")

newdateformat

然后我尝试使用解析,但是它也给出了NA,这意味着解析失败

require(lubridate)
parse_date_time(my_data$Birthdate, orders = c("ymd", "mdy"))

[1]不适用不适用“ 2001-09-12 UTC”不适用
[5]“ 2005-02-18 UTC”

,我也可以找到数据框中第一个日期的格式是“ 36085.0” 我确实找到了这段代码,但仍然不明白数字的含义以及“原产地”的含义

dates <- c(30829, 38540)
  betterDates <- as.Date(dates,
    origin = "1899-12-30")

p / s:我对R很陌生,因此,如果您能使用更简单的解释,谢谢您,谢谢您

1 个答案:

答案 0 :(得分:0)

您应该分别解析每种格式。对于每种格式,请使用正则表达式选择相关行,并仅转换这些行,然后继续使用下一种格式。我会用data.table而不是data.frame给出答案,因为我忘记了如何使用data.frame。

library(lubridate)
library(data.table)
x = data.table("Name" = c("A","B","C","D","E"),
  "Birthdate" = c("36085.0","2001-sep-12","Feb-18-2005","05/27/84", "2020-6-25"))
# or use setDT(x) to convert an existing data.frame to a data.table

# handle dates like "2001-sep-12" and "2020-6-25"
# this regex matches strings beginning with four numbers and then a dash
x[grepl('^[0-9]{4}-',Birthdate),Birthdate1:=ymd(Birthdate)]

# handle dates like "36085.0": days since 1904 (or 1900)
# see https://docs.microsoft.com/en-us/office/troubleshoot/excel/1900-and-1904-date-system
# this regex matches strings that only have numeric characters and .
x[grepl('^[0-9\\.]+$',Birthdate),Birthdate1:=as.Date(as.numeric(Birthdate),origin='1904-01-01')]

# assume the rest are like "Feb-18-2005" and "05/27/84" and handle those
x[is.na(Birthdate1),Birthdate1:=mdy(Birthdate)]

# result

> x
   Name   Birthdate Birthdate1
1:    A     36085.0 2002-10-18
2:    B 2001-sep-12 2001-09-12
3:    C Feb-18-2005 2005-02-18
4:    D    05/27/84 1984-05-27
5:    E   2020-6-25 2020-06-25