I am using a large dataset and I am not used to using one this big (286,212 rows, 19 columns) and I am not sure how to go about my problem. the data is made up of values for each day of the year for 782 grid references and I have this for 15 years. It looks as follows
**Month Day Grid x2004 x2005 x2006 x2007**
1 1 A10 0.091 0.134 NA 0.066
1 2 A10 0.12 0.10 0.23 0.054
1 3 A10 0.55 NA NA 0.08
1 1 B10 NA 0.134 NA 0.17
1 2 B10 0.14 0.151 NA 0.21
1 3 B10 0.43 0.162 0.24 NA
However some of the days are missing and I want to insert the mean of that day for that specific grid using values from the other years. So if the Grid A10 for day 1 in 2006 is missing. I want to insert the mean for day 1 grid A10 from 2004, 2005, 2007, in this case 0.097.
I am trying the following code
ind <- which(is.na(data$x2005))
data$x2005[ind] <- sapply(ind, function(i)
with(data, rowMeans(data[c(data$x2004[i], data$x2006[i], data$x2007[i], data$x2008[i], data$x2009[i],
data$x2010[i], data$x2011[i], data$x2012[i],
data$x2013[i], data$x2014[i], data$x2015[i],
data$x2016[i], data$x2017[i]),], na.rm=TRUE)))
and I plan to do that for all years but it is telling me
"Error in rowMeans(data[c(data$x2006[i], data$x2007[i], data$x2012[i]), :
'x' must be numeric"
Although when I check class, it says that they are all numeric, so I am not sure why x is not numeric. I also don't know if even when i get the mean part sorted, if the code will work so that I am getting the mean specific to each grid and day.
Please Help. Thanks
答案 0 :(得分:0)
Can you adapt this to your code:
for(i in 1:ncol(data)){
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}