我一直在使用相当多的代码来强调某些要求未得到满足,或者告诉我哪些条目是重复的,但我还没有弄清楚如何代码,如果没有满足要求。
我使用的是包含日期的相对平均的数据框。通常每天应该有24个条目,即每小时一个条目。但在某些情况下,这或多或少会有所不同。我需要一些可以告诉我哪个条目号/数据不满足24个条目的标准。有没有人对如何处理这个有任何建议?
我已经附上了我迄今为止用过的代码示例(以完成其他功能)。
td_1 <- read.csv("testdata_1.csv", header=TRUE)
td_1$OB_DATE <- as.Date(td_1$OB_DATE)
valueMissing <- seq(min(td_1$OB_DATE), max(td_1$OB_DATE), by = 1)
valueMissing[!valueMissing %in% td_1$OB_DATE]
countDup <- anyDuplicated(td_1$OB_DATE)
valueDup <- td1[duplicated(td_1$OB_DATE),]
下面是一个数据示例(请注意,实际上有超过500,000行,这只是一个小样本)
OB_DATE AIR_TEMPERATURE
09/05/1973 00:00 10
09/05/1973 01:00 10.2
09/05/1973 02:00 10
09/05/1973 03:00 10
09/05/1973 04:00 9.9
09/05/1973 05:00 9.9
09/05/1973 06:00 10.2
09/05/1973 07:00 10.8
09/05/1973 08:00 12.2
09/05/1973 09:00 11.9
09/05/1973 10:00 12.7
09/05/1973 11:00 12.8
09/05/1973 12:00 13.4
09/05/1973 13:00 13.9
09/05/1973 14:00 14.6
09/05/1973 15:00 13.5
09/05/1973 16:00 13.5
09/05/1973 17:00 12.8
09/05/1973 18:00 12.2
09/05/1973 19:00 11.9
09/05/1973 20:00 11
09/05/1973 21:00 10.3
09/05/1973 22:00 10.2
09/05/1973 23:00 10
10/05/1973 00:00 10
10/05/1973 01:00 9.8
10/05/1973 02:00 9.6
10/05/1973 03:00 9.7
10/05/1973 04:00 9.5
10/05/1973 05:00 8.5
10/05/1973 06:00 7.5
10/05/1973 07:00 7.8
10/05/1973 08:00 8.8
10/05/1973 09:00 9.6
10/05/1973 10:00 10
10/05/1973 11:00 11
10/05/1973 12:00 8
10/05/1973 13:00 10.3
10/05/1973 14:00 12.2
10/05/1973 15:00 12.7
10/05/1973 16:00 12.7
10/05/1973 17:00 12.4
10/05/1973 17:00 12.4
10/05/1973 18:00 12
10/05/1973 18:00 12
10/05/1973 19:00 10.9
10/05/1973 20:00 9.4
10/05/1973 21:00 7.2
10/05/1973 22:00 6.7
10/05/1973 23:00 6.8
11/05/1973 00:00 5.7
11/05/1973 01:00 5.2
11/05/1973 02:00 4.7
11/05/1973 03:00 4.3
11/05/1973 04:00 4
11/05/1973 05:00 4.2
11/05/1973 06:00 5
11/05/1973 08:00 8.4
11/05/1973 09:00 9.2
11/05/1973 10:00 10.8
11/05/1973 11:00 11.7
11/05/1973 12:00 11.4
11/05/1973 13:00 12.9
11/05/1973 14:00 13.3
11/05/1973 15:00 13.3
11/05/1973 16:00 13.5
11/05/1973 17:00 13.6
11/05/1973 18:00 12.6
11/05/1973 19:00 11.8
11/05/1973 20:00 10.3
11/05/1973 21:00 9.7
11/05/1973 22:00 8.8
11/05/1973 23:00 7.6
在这种情况下,第10个数据完成24个条目,但是第11个数据只有26个条目,第12个条目有23个条目。我需要能够提醒我这个事实的东西,例如:给出日期11/05/1973和12/05/1973(类似于我为缺失值代码生成的输出)。
答案 0 :(得分:2)
我们可以使用data.table
library(data.table)
setDT(df)[, new := as.integer(.N==24), by = .(Date=as.IDate(OB_DATE, "%m/%d/%Y %H:%M"))]
head(df)
# OB_DATE AIR_TEMPERATURE new
#1: 09/05/1973 00:00 10.0 1
#2: 09/05/1973 01:00 10.2 1
#3: 09/05/1973 02:00 10.0 1
#4: 09/05/1973 03:00 10.0 1
#5: 09/05/1973 04:00 9.9 1
#6: 09/05/1973 05:00 9.9 1
答案 1 :(得分:1)
使用dplyr
,
library(dplyr)
df %>%
group_by(dates = gsub('\\s+.*', '', OB_DATE)) %>%
summarise(new = n())
# A tibble: 3 × 2
# dates new
# <chr> <int>
#1 1973-09-05 24
#2 1973-10-05 26
#3 1973-11-05 23
或类似地你可以做类似的事情,
df %>%
group_by(dates = gsub('\\s+.*', '', OB_DATE)) %>%
mutate(new = ifelse(n() == 24, 0, 1)) #Will give value of 1 to dates that don't satisfy the 24 criterion
select(-dates)