首次记录测量后计算缺失值

时间:2014-03-21 11:14:09

标签: r missing-data

我的环境数据缺失值。其中一些变量的测量始于不同年份。

使用脚本 “sapply(df, function(x) sum(is.na(x)))"我得到每列的缺失值数。但我希望从至少一次测量可用的时间点开始计算缺失值。例如,对于o3,从o3开始的时间测量开始,缺失值应该仅为3。另外我想在测量可用时提取第一个日期(示例温度在01-03-1990,03在09-03-1990)。总之,我的愿望是:

1.  Extract the first date of available measurement for each column.
2.  Count the number of missing values after at least one measurement is available.

示例数据如下

> dput(df)
structure(list(date = structure(c(7364, 7365, 7366, 7367, 7368, 
7369, 7370, 7371, 7372, 7373, 7374, 7375, 7376, 7377, 7378, 7379, 
7380, 7381, 7382, 7383, 7384), class = "Date"), no2 = c(51.7008334795634, 
33.8999998569489, 29.7854166030884, 29.0558333396912, 28.5108333031336, 
31.9637500842412, 36.1283330917358, 24.6608331998189, 33.2682609558105, 
NA, NA, NA, 53.1133330663045, 54.1575004259745, 43.7712502479553, 
31.0166666905085, 31.9995832443237, 33.3491666316986, NA, NA, 
35.5604347353396), temp = c(1.12583327293396, 0.230416655540466, 
-0.415833324193954, 3.50333333015442, 4.88708353042603, 3.54916667938232, 
2.15291666984558, 6.84916687011719, 3.79416656494141, 1.50416672229767, 
0.736666679382324, 3.33291673660278, -0.466250002384186, 1.47374999523163, 
6.84124994277954, 9.93249988555908, NA, NA, NA, 6.88000011444092, 
6.19999980926514), humidity = c(NA, 75.1428604125977, 64.375, 
NA, 82.125, 61.375, 71.5, 68.25, NA, 74, 82.375, 82.5, 60.875, 
80, 82.625, 88.75, 78.5, 73.125, 68.5, 49.2811088562012, 79.8091659545898
), o3 = c(NA, NA, NA, NA, NA, NA, NA, NA, 63.0712509155273, 69.6487503051758, 
60.903751373291, NA, 72.942497253418, NA, NA, 66.2587509155273, 
78.3262481689453, 101.066246032715, 112.137496948242, 77.0224990844727, 
68.5950012207031)), .Names = c("date", "no2", "temp", "humidity", 
"o3"), row.names = c("60", "61", "62", "63", "64", "65", "66", 
"67", "68", "69", "70", "71", "72", "73", "74", "75", "76", "77", 
"78", "79", "80"), class = "data.frame")

1 个答案:

答案 0 :(得分:2)

获取第一个非缺失值:

first <- sapply(df, function(x) which(!is.na(x))[1])
dateOfFirst <- df$date[first]

然后NA第一次运行后NA的数量是NA的总数,带走初始运行的长度

numberOfMissing <- sapply(df, function(x) sum(is.na(x))) - (first-1)