构造一个循环来比较两个数据帧并填充缺失的值

时间:2017-11-01 19:05:53

标签: r loops dataframe match fill

如果我有这个数据

C=
year month  day  hour minute    rain
2010    01   01    00     00   0.000
2011    01   01    00     00   0.000
2012    01   01    00     00   0.000
2013    01   01    00     00   0.000
2014    01   01    00     00   0.000
2015    01   01    00     15   0.000

和参考数据如:

R=
year month  day  hour minute    rain
2013    01   01    00     00   0.000
2013    01   01    00     05   0.000
2013    01   01    00     10   0.000
2013    01   01    00     15   0.000
2013    01   01    00     20   0.000
2014    01   01    00     00   0.000
2014    01   01    00     05   0.000
2014    01   01    00     10   0.000
2014    01   01    00     15   0.000
2014    01   01    00     20   0.000
2015    01   01    00     00   0.000
2015    01   01    00     05   0.000
2015    01   01    00     10   0.000
2015    01   01    00     15   0.000
2015    01   01    00     20   0.000

我需要完成这个M

M=
year month  day  hour minute    rain
2013    01   01    00     00   0.000
2013    01   01    00     05   0.000
2013    01   01    00     10   0.000
2013    01   01    00     15   0.000
2013    01   01    00     20   0.000
2014    01   01    00     00   0.000
2014    01   01    00     05   0.000
2014    01   01    00     10   0.000
2014    01   01    00     15   0.000
2014    01   01    00     20   0.000
2015    01   01    00     15   0.000
2015    01   01    00     20   0.000

如你所见,2015年的M开始于" 2015 01 01 00 15 0.000"并且我们可以使用C信息来创建一个nrow(c)循环并查找开始日期,这个想法是使用循环完成此数据框并匹配以从参考数据R填充年,月,日,小时和分钟,并填充空雨柱用" NaN"。最终的输出是:

F=
year month  day  hour minute    rain
2013    01   01    00     00   0.000
2013    01   01    00     05   0.000
2013    01   01    00     10   0.000
2013    01   01    00     15   0.000
2013    01   01    00     20   0.000
2014    01   01    00     00   0.000
2014    01   01    00     05   0.000
2014    01   01    00     10   0.000
2014    01   01    00     15   0.000
2014    01   01    00     20   0.000
2015    01   01    00     00   NaN
2015    01   01    00     05   NaN  
2015    01   01    00     10   NaN
2015    01   01    00     15   0.000
2015    01   01    00     20   0.000

1 个答案:

答案 0 :(得分:3)

要从参考数据R填写data.table中缺少的行,可以使用library(data.table) setDT(M)[setDT(R)[, -"rain"], on = .(year, month, day, hour, minute)] 实现为右连接。因此,不需要循环。

    year month day hour minute rain
 1: 2013     1   1    0      0    0
 2: 2013     1   1    0      5    0
 3: 2013     1   1    0     10    0
 4: 2013     1   1    0     15    0
 5: 2013     1   1    0     20    0
 6: 2014     1   1    0      0    0
 7: 2014     1   1    0      5    0
 8: 2014     1   1    0     10    0
 9: 2014     1   1    0     15    0
10: 2014     1   1    0     20    0
11: 2015     1   1    0      0   NA
12: 2015     1   1    0      5   NA
13: 2015     1   1    0     10   NA
14: 2015     1   1    0     15    0
15: 2015     1   1    0     20    0
R

编辑:控制年份

OP已询问herehere如何控制岁月。由于上面的代码暗示了右连接,因此R的所有行都出现在结果集中。因此,setDT(M)[setDT(R)[year == 2014L, -"rain"], on = .(year, month, day, hour, minute)] 需要适当过滤。这可以通过明确指定一年来完成

   year month day hour minute rain
1: 2014     1   1    0      0    0
2: 2014     1   1    0      5    0
3: 2014     1   1    0     10    0
4: 2014     1   1    0     15    0
5: 2014     1   1    0     20    0
setDT(M)[setDT(R)[year %in% 2014:2018, -"rain"], on = .(year, month, day, hour, minute)]

或一系列年份

    year month day hour minute rain
 1: 2014     1   1    0      0    0
 2: 2014     1   1    0      5    0
 3: 2014     1   1    0     10    0
 4: 2014     1   1    0     15    0
 5: 2014     1   1    0     20    0
 6: 2015     1   1    0      0   NA
 7: 2015     1   1    0      5   NA
 8: 2015     1   1    0     10   NA
 9: 2015     1   1    0     15    0
10: 2015     1   1    0     20    0
M

或查看M[, unique(year)]

中包含的年份
[1] 2013 2014 2015
setDT(M)[setDT(R)[year %in% M[, unique(year)], -"rain"], 
         on = .(year, month, day, hour, minute)]
    year month day hour minute rain
 1: 2013     1   1    0      0    0
 2: 2013     1   1    0      5    0
 3: 2013     1   1    0     10    0
 4: 2013     1   1    0     15    0
 5: 2013     1   1    0     20    0
 6: 2014     1   1    0      0    0
 7: 2014     1   1    0      5    0
 8: 2014     1   1    0     10    0
 9: 2014     1   1    0     15    0
10: 2014     1   1    0     20    0
11: 2015     1   1    0      0   NA
12: 2015     1   1    0      5   NA
13: 2015     1   1    0     10   NA
14: 2015     1   1    0     15    0
15: 2015     1   1    0     20    0
R <- structure(list(year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L
), month = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), day = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L), minute = c(0L, 5L, 10L, 15L, 20L, 
0L, 5L, 10L, 15L, 20L, 0L, 5L, 10L, 15L, 20L), rain = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), .Names = c("year", "month", 
"day", "hour", "minute", "rain"), row.names = c(NA, -15L), class = "data.frame")

M <- structure(list(year = c(2013L, 2013L, 2013L, 2013L, 2013L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2015L, 2015L), month = c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), day = c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), hour = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), minute = c(0L, 5L, 10L, 15L, 
20L, 0L, 5L, 10L, 15L, 20L, 15L, 20L), rain = c(0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0)), .Names = c("year", "month", "day", "hour", 
"minute", "rain"), row.names = c(NA, -12L), class = "data.frame")

数据

FROM [Owner Training Report]
WHERE ((([Owner Training Report].[Status Date])>DateAdd('m',12,Date())) AND (([Owner Training Report].[Reporting Basic Status])='Completed'));