栅格中的NA和randomForest :: predict()

时间:2014-06-17 18:29:27

标签: r raster random-forest na

新来的,如果您需要更多信息,请告诉我。

我的目标:我正在使用Rehfeldt气候数据和eBird存在/缺失数据,使用随机森林模型生成小众模型。

我的问题:我想预测整个北美的小众模特。 Rehfeldt气候栅格具有非洲大陆每个细胞的数据值,但这些数据值被“海洋细胞”中的NA所包围。参见情节here,我在那里将NAs染成了深绿色。如果独立数据集包含NA,则randomForest :: predict()不会运行。因此,我想裁剪我的气候栅格(或设置工作范围?),以便predict()函数仅对包含数据的单元格进行操作。

故障排除:

  1. 我使用较小的范围运行随机森林模型,该模型不包括栅格的“NA海洋”,模型运行得很好。所以,我知道NAs是问题所在。但是,我不想仅仅为北美的一块长方形预测我的利基模型。

  2. 我使用了flowla的方法here来为北美使用多边形shapefile裁剪和屏蔽栅格。我希望这会删除NAs,但事实并非如此。是否有类似的东西可以删除NAs?

  3. 我已经完成了一些阅读,但无法找到一种方法来调整随机森林代码本身,以便predict()忽略NAs。 This post看起来很相关,但我不确定这对我的情况是否有帮助。

  4. 数据

    我的栅格,输入的存在/不存在文本文件以及其他功能的代码是here。使用下面的主要代码作为可重现的示例。

    代码

    require(sp)
    require(rgdal)
    require(raster)
    library(maptools)
    library(mapproj)
    library(dismo)
    library(maps)
    library(proj4)
    data(stateMapEnv)
    
    # This source code has all of the functions necessary for running the Random Forest models, as well as the code for the function detecting multi-collinearity
    source("Functions.R")
    
    # Read in Rehfeldt climate rasters
    # these rasters were converted to .img and given WGS 84 projection in ArcGIS
    
    d100 <- raster("d100.img")
    dd0 <- raster("dd0.img")
    dd5 <- raster("dd5.img")
    fday <- raster("fday.img")
    ffp <- raster("ffp.img")
    gsdd5 <- raster("gsdd5.img")
    gsp <- raster("gsp.img")
    map <- raster("map.img")
    mat <- raster("mat_tenths.img")
    mmax <- raster("mmax_tenths.img")
    mmin <- raster("mmin_tenths.img")
    mmindd0 <- raster("mmindd0.img")
    mtcm <- raster("mtcm_tenths.img")
    mtwm <- raster("mtwm_tenths.img")
    sday <- raster("sday.img")
    smrpb <- raster("smrpb.img")
    
    # add separate raster files into one big raster, with each file being a different layer.
    rehfeldt <- addLayer(d100, dd0, dd5, fday, ffp, gsdd5, gsp, map, mat, mmax, mmin, mmindd0, mtcm, mtwm, sday, smrpb)
    
    # plot some rasters to make sure everything worked
    plot(d100)
    plot(rehfeldt)
    
    # read in presence/absence data
    LAZB.INBUtemp <- read.table("LAZB.INBU.txt", header=T, sep = "\t")
    colnames(LAZB.INBUtemp) <- c("Lat", "Long", "LAZB", "INBU")
    LAZB.INBUtemp <- LAZB.INBUtemp[c(2,1,3,4)]
    LAZB.INBU <- LAZB.INBUtemp
    
    latpr <- (LAZB.INBU$Lat)
    lonpr <- (LAZB.INBU$Long)
    sites <- SpatialPoints(cbind(lonpr, latpr))
    LAZB.INBU.spatial <- SpatialPointsDataFrame(sites, LAZB.INBU, match.ID=TRUE)
    
    # The below function extracts raster values for each of the different layers for each of the eBird locations
    pred <- raster::extract(rehfeldt, LAZB.INBU.spatial)
    LAZB.INBU.spatial@data = data.frame(LAZB.INBU.spatial@data, pred)
    LAZB.INBU.spatial@data <- na.omit(LAZB.INBU.spatial@data)
    
    # ITERATIVE TEST FOR MULTI-COLINEARITY
    # Determines which variables show multicolinearity
    cl <- MultiColinear(LAZB.INBU.spatial@data[,7:ncol(LAZB.INBU.spatial@data)], p=0.05)
    xdata <- LAZB.INBU.spatial@data[,7:ncol(LAZB.INBU.spatial@data)]  
    for(l in cl) {
      cl.test <- xdata[,-which(names(xdata)==l)]
      print(paste("REMOVE VARIABLE", l, sep=": "))
      MultiColinear(cl.test, p=0.05)    
    }
    
    # REMOVE MULTI-COLINEAR VARIABLES
    for(l in cl) { LAZB.INBU.spatial@data <- LAZB.INBU.spatial@data[,-which(names(LAZB.INBU.spatial@data)==l)] }
    
    
    ################################################################################################
    
    # FOR LAZB
    # RANDOM FOREST MODEL AND RASTER PREDICTION
    
    require(randomForest)
    
    # NUMBER OF BOOTSTRAP REPLICATES
    b=1001
    
    # CREATE X,Y DATA
    
    # use column 3 for LAZB and 4 for INBU
    ydata <- as.factor(LAZB.INBU.spatial@data[,3])
    xdata <- LAZB.INBU.spatial@data[,7:ncol(LAZB.INBU.spatial@data)]
    
    # PERCENT OF PRESENCE OBSERVATIONS
    ( dim(LAZB.INBU.spatial[LAZB.INBU.spatial$LAZB == 1, ])[1] / dim(LAZB.INBU.spatial)[1] ) * 100
    
    # RUN RANDOM FORESTS MODEL SELECTION FUNCTION
    # This model is using the model improvement ratio to select a final model.
    
    pdf(file = "LAZB Random Forest Model Rehfeldt.pdf")
    ( rf.model <- rf.modelSel(x=xdata, y=ydata, imp.scale="mir", ntree=b) ) 
    dev.off()
    
    # RUN RANDOM FORESTS CLASS BALANCE BASED ON SELECTED VARIABLES
    # This code would help in the case of imbalanced sample
    mdata <- data.frame(y=ydata, xdata[,rf.model$SELVARS])
    rf.BalModel <- rfClassBalance(mdata[,1], mdata[,2:ncol(mdata)], "y", ntree=b)
    
    # CREATE NEW XDATA BASED ON SELECTED MODEL AND RUN FINAL RF MODEL
    sel.vars <- rf.model$PARAMETERS[[3]]
    rf.data <- data.frame(y=ydata, xdata[,sel.vars])  
    write.table(rf.data, "rf.data.txt", sep = ",", row.names = F)
    
    # This the code given to me; takes forever to run for my dataset (I haven't tried to let it finish)
    # ( rf.final <- randomForest(y ~ ., data=rf.data, ntree=b, importance=TRUE, norm.votes=TRUE, proximity=TRUE) )
    
    # I use this form because it's a lot faster
    ( rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F) )
    
    ################################################################################################         
    # MODEL VALIDATION 
    # PREDICT TO VALIDATION DATA
    
    # Determines the percent correctly classified
    rf.pred <- predict(rf.final, rf.data[,2:ncol(rf.data)], type="response")
    rf.prob <- as.data.frame(predict(rf.final, rf.data[,2:ncol(rf.data)], type="prob"))
    ObsPred <- data.frame(cbind(Observed=as.numeric(as.character(ydata)), 
                                PRED=as.numeric(as.character(rf.pred)), Prob1=rf.prob[,2], 
                                Prob0=rf.prob[,1]) )
    op <- (ObsPred$Observed == ObsPred$PRED)
    ( pcc <- (length(op[op == "TRUE"]) / length(op))*100 )
    
    # PREDICT MODEL PROBABILITIES RASTER
    
    # The first line of code says what directory I'm working, and then what folder in that directory has the raster files that I'm using to predict the range
    # The second line defines the x variable, wich is my final Random Forest model
    
    rpath=paste('~/YOURPATH', "example", sep="/")
    xvars <- stack(paste(rpath, paste(rownames(rf.final$importance), "img", sep="."), sep="/"))
    tr <-  blockSize(xvars)
    s <- writeStart(xvars[[1]], filename=paste('~/YOURPATH', "prob_LAZB_Rehfeldt.img", sep="/"), overwrite=TRUE)                                           
    for (i in 1:tr$n) {
      v <- getValuesBlock(xvars, row=tr$row[i], nrows=tr$nrows[i])
      v <- as.data.frame(v)         
      rf.pred <- predict(rf.final, v, type="prob")[,2]           
      writeValues(s, rf.pred, tr$row[i])
    }
    s <- writeStop(s)   
    
    prob_LAZB <- raster("prob_LAZB_Rehfeldt.img")
    
    # Write range prediction raster to .pdf
    pdf(file="LAZB_range_pred.pdf")
    plot(prob_LAZB)
    map("state", add = TRUE)
    dev.off()
    

    谢谢!

1 个答案:

答案 0 :(得分:0)

您是否尝试在拨打RF时设置“na.action”?该选项在randomForest R manual中有明确标注。您对RF的调用如下所示:

rf.final <- randomForest(x = rf.data[2:6], y = rf.data$y, ntree=1000, importance=TRUE, norm.votes=TRUE, proximity=F, na.action = omit)

这将告诉RF省略NA存在的行,从而抛弃这些观察结果。这不一定是最好的方法,但在您的情况下可能会很方便。

选项2:rfImputena.roughfix:这将填写您的NA,以便您可以继续进行预测。请注意,因为这可以在任何地方进行虚假预测,并且修复了&#34;固定&#34;。

选项3:从选项2开始,在得到预测后,将光栅带入您选择的GIS /图像处理软件中,并屏蔽掉您不想要的区域。在你的情况下,掩盖水体将非常简单。