Lubridate无法正确解析包含工作日/月/日/年的日期

时间:2018-10-20 11:22:16

标签: r regex lubridate stringr

问题

我从网站上下载了一个数据库,“大冶”栏的格式如下:

x <- c("Fri, Mar 1, 2019", "Sat, Mar 2, 2019", "Sun, Mar 3, 2019", "Mon, Mar 4, 2019", "Tue, Mar 5, 2019", "Wed, Mar 6, 2019", "Thu, Mar 7, 2019", "Fri, Mar 8, 2019", "Sat, Mar 9, 2019", "Sun, Mar 10, 2019", "Mon, Mar 11, 2019", "Tue, Mar 12, 2019", "Wed, Mar 13, 2019", "Thu, Mar 14, 2019", "Fri, Mar 15, 2019", "Sat, Mar 16, 2019", "Sun, Mar 17, 2019", "Mon, Mar 18, 2019", "Tue, Mar 19, 2019", "Wed, Mar 20, 2019", "Thu, Mar 21, 2019", "Fri, Mar 22, 2019", "Sat, Mar 23, 2019", "Sun, Mar 24, 2019", "Mon, Mar 25, 2019",  "Tue, Mar 26, 2019", "Wed, Mar 27, 2019", "Thu, Mar 28, 2019", "Fri, Mar 29, 2019", "Sat, Mar 30, 2019", "Sun, Mar 31, 2019")

其中包含从3月1日到31日的日期。我试图将其转换为日期格式,所以我在lubridate中使用了y ,dy函数:

library("lubridate")
mdy(x)

产生以下向量:

 [1] "2019-03-01" "2019-03-02" "2019-03-20" "2019-04-20" "2019-05-20" "2019-03-06"
 [7] "2019-03-07" "2019-03-08" "2019-03-09" "2019-10-20" "2019-11-20" "2019-12-20"
[13] "2019-03-13" "2019-03-14" "2019-03-15" "2019-03-16" "2019-03-17" "2019-03-18"
[19] "2019-03-19" "2019-03-20" "2019-03-21" "2019-03-22" "2019-03-23" "2019-03-24"
[25] "2019-03-25" "2019-03-26" "2019-03-27" "2019-03-28" "2019-03-29" "2019-03-30"
[31] "2019-03-31"

如您所见,大多数日期都是正确的,但它不适用于该月的第4、5、10、11和12日,在该日期中,日期的显示就像是一个月。我一直在尝试几种解决方案,但到目前为止都没有成功

一些可能不可行的解决方案

使用正则表达式从字符向量中删除工作日:

我认为解决此问题的一种方法是删除字符串中的工作日部分,因此我尝试删除逗号前的所有内容,但我不能完全做到这一点:

library(stringr)
y <- str_extract(Dt,",.*$")
y 
 [1] ", Mar 1, 2019"  ", Mar 2, 2019"  ", Mar 3, 2019"  ", Mar 4, 2019" 
 [5] ", Mar 5, 2019"  ", Mar 6, 2019"  ", Mar 7, 2019"  ", Mar 8, 2019" 
 [9] ", Mar 9, 2019"  ", Mar 10, 2019" ", Mar 11, 2019" ", Mar 12, 2019"
 [13] ", Mar 13, 2019" ", Mar 14, 2019" ", Mar 15, 2019" ", Mar 16, 2019"
 [17] ", Mar 17, 2019" ", Mar 18, 2019" ", Mar 19, 2019" ", Mar 20, 2019"
 [21] ", Mar 21, 2019" ", Mar 22, 2019" ", Mar 23, 2019" ", Mar 24, 2019"
 [25] ", Mar 25, 2019" ", Mar 26, 2019" ", Mar 27, 2019" ", Mar 28, 2019"
 [29] ", Mar 29, 2019" ", Mar 30, 2019" ", Mar 31, 2019"

但是现在当我使用mdy时,我发现前12天都是错误的。

mdy(y)

[1] "2019-01-20" "2019-02-20" "2019-03-20" "2019-04-20" "2019-05-20" "2019-06-20"
[7] "2019-07-20" "2019-08-20" "2019-09-20" "2019-10-20" "2019-11-20" "2019-12-20"
[13] "2019-03-13" "2019-03-14" "2019-03-15" "2019-03-16" "2019-03-17" "2019-03-18"
[19] "2019-03-19" "2019-03-20" "2019-03-21" "2019-03-22" "2019-03-23" "2019-03-24"
[25] "2019-03-25" "2019-03-26" "2019-03-27" "2019-03-28" "2019-03-29" "2019-03-30"
[31] "2019-03-31"

关于如何解决此问题的任何想法?

SessionInfo

我根据要求添加了SessionInfo

R version 3.4.4 (2018-03-15) 
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_CL.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_CL.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_CL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_CL.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.3.1   dplyr_0.7.6     rvest_0.3.2     xml2_1.2.0      XML_3.98-1.16  
[6] lubridate_1.7.4

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18     rstudioapi_0.7   knitr_1.20       bindr_0.1.1     
 [5] magrittr_1.5     tidyselect_0.2.4 R6_2.2.2         rlang_0.2.2     
 [9] httr_1.3.1       tools_3.4.4      pacman_0.4.6     selectr_0.4-1    
 [13] htmltools_0.3.6  yaml_2.2.0       rprojroot_1.3-2  digest_0.6.17   
 [17] assertthat_0.2.0 tibble_1.4.2     crayon_1.3.4     bindrcpp_0.2.2    
 [21] purrr_0.2.5      curl_3.2         glue_1.3.0       evaluate_0.11    
 [25] rmarkdown_1.10   stringi_1.2.4    pillar_1.3.0     compiler_3.4.4  
 [29] backports_1.1.2  pkgconfig_2.0.2 

1 个答案:

答案 0 :(得分:2)

就像@duckmayr认为这是一个语言环境问题一样,如我在sessioninfo中所示,我的语言环境设置如下:

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=es_CL.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=es_CL.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=es_CL.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_CL.UTF-8 LC_IDENTIFICATION=C  

当我将LC_TIME更改为en_US.UTF-8时,一切都固定了,当我这样做了:

Sys.setlocale("LC_TIME", 'en_US.UTF-8')

然后使用mdy可以正常工作。希望这对以后遇到类似问题的人有所帮助