我有130个足球比赛的数据集,其历史可以追溯到1893年。
当我从Excel导入数据集时,1900年的日期与R(RStudio)中的日期相同。但是,1900年以前的日期是NA。
我该如何解决它,以便所有日期都以正确的格式来自Excel?
或者,如何用正确的(18XX-MM-DD)日期替换NA?
这是Excel中显示的数据:
Home_Team, Away_Team, SUFC, SWFC, H, A, Score, Season, Date
Sheffield United, Sheffield Wednesday, 1, 1, 1, 1, 1–1, 1893/94, 1893-10-16
Sheffield United, Sheffield Wednesday, 1, 0, 1, 0, 1–0, 1894/95, 1895-01-12
Sheffield United, Sheffield Wednesday, 1, 1, 1, 1, 1–1, 1895/96, 1895-12-26
Sheffield United, Sheffield Wednesday, 2, 0, 2, 0, 2–0, 1896/97, 1896-12-26
Sheffield United, Sheffield Wednesday, 1, 1, 1, 1, 1–1, 1897/98, 1897-12-27
Sheffield United, Sheffield Wednesday, 2, 1, 2, 1, 2–1, 1898/99, 1898-12-26
Sheffield United, Sheffield Wednesday, 1, 0, 1, 0, 1–0, 1900/01, 1900-12-15
Sheffield United, Sheffield Wednesday, 3, 0, 3, 0, 3–0, 1901/02, 1902-03-01
Sheffield United, Sheffield Wednesday, 2, 3, 2, 3, 2–3, 1902/03, 1902-09-01
Sheffield United, Sheffield Wednesday, 1, 1, 1, 1, 1–1, 1903/04, 1903-12-12
Sheffield United, Sheffield Wednesday, 4, 2, 4, 2, 4–2, 1904/05, 1905-04-08
Sheffield United, Sheffield Wednesday, 0, 2, 0, 2, 0–2, 1905/06, 1905-10-21
这是我适用的R代码:
library(tidyverse)
library(readxl)
library(magrittr)
library(dplyr)
library(ggplot2)
library(tidyr)
Sheff_derby_R <- read_excel("sheffield_derby/Sheff_derby_R.xlsx",
col_types = c("text", "text", "text",
"text", "text", "text", "text",
"text",
"date", "text", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "text"))
View(Sheff_derby_R)
在R中,将18xx日期(最后一栏,不是18xx / xx季节列)替换为NA。这是头,前12行:
Home_Team Away_Team SUFC SWFC H A Score Season Date
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dttm>
1 Sheffiel~ Sheffiel~ 1 1 1 1 1 – 1 1893/~ NA
2 Sheffiel~ Sheffiel~ 1 0 1 0 1 – 0 1894/~ NA
3 Sheffiel~ Sheffiel~ 1 1 1 1 1 – 1 1895/~ NA
4 Sheffiel~ Sheffiel~ 2 0 2 0 2 – 0 1896/~ NA
5 Sheffiel~ Sheffiel~ 1 1 1 1 1 – 1 1897/~ NA
6 Sheffiel~ Sheffiel~ 2 1 2 1 2 – 1 1898/~ NA
7 Sheffiel~ Sheffiel~ 1 0 1 0 1 – 0 1900/~ 1900-12-15 00:00:00
8 Sheffiel~ Sheffiel~ 3 0 3 0 3 – 0 1901/~ 1902-03-01 00:00:00
9 Sheffiel~ Sheffiel~ 2 3 2 3 2 – 3 1902/~ 1902-09-01 00:00:00
10 Sheffiel~ Sheffiel~ 1 1 1 1 1 – 1 1903/~ 1903-12-12 00:00:00
11 Sheffiel~ Sheffiel~ 4 2 4 2 4 – 2 1904/~ 1905-04-08 00:00:00
12 Sheffiel~ Sheffiel~ 0 2 0 2 0 – 2 1905/~ 1905-10-21 00:00:00
答案 0 :(得分:1)
不管有什么错误,这都是可能的(临时)解决方法:
首先,以"text"
读入以查看R有点窒息的一个原因。 (我在这里简化了read_excel
参数,因为在这种情况下"text"
是Date
的默认设置。在您的情况下,只需从"date"
更新为"text"
中的col_types
。)
library(readxl)
x <- read_excel("Sheff_derby_SO.xlsx")
x$Date
# [1] "1893-10-16" "1895-01-12" "1895-12-26" "1896-12-26" "1897-12-27"
# [6] "1898-12-26" "350" "791" "975" "1442"
# [11] "1925"
对于1900年及以后的日期,它们将作为整数传递。它们碰巧都是基于相同的日期起源,因此,我们可以做到以下几点:
wrong <- !grepl("-", x$Date)
as.Date("1900-01-01") + as.integer(x$Date[wrong]) - 2L
# [1] "1900-12-15" "1902-03-01" "1902-09-01" "1903-12-12" "1905-04-08"
与excel告诉我的认为应该匹配。
N.B .:我希望它是一个简单的偏移量,但是需要+ 2L
使其对齐。这表明可能还有其他情况发生,因此请对您的所有数据进行验证(以防万一,不是所有的数据),该hack是否适用于其他值。
将它们替换为数据集很简单
sav <- as.Date("1900-01-01") + as.integer(x$Date[wrong]) - 2L
x$Date <- as.Date(x$Date) # 'wrong' ones will be NA
x$Date[wrong] <- sav
x$Date
# [1] "1893-10-16" "1895-01-12" "1895-12-26" "1896-12-26" "1897-12-27"
# [6] "1898-12-26" "1900-12-15" "1902-03-01" "1902-09-01" "1903-12-12"
# [11] "1905-04-08"