使用R中的randomForest预测/估计值

时间:2016-01-18 21:12:30

标签: r data-modeling random-forest prediction

我想根据调查区域预测未调查区域Pop_avg字段的值。我根据对前一个问题的建议使用randomForest。

我的调查范围:

> surveyed <- read.csv("summer_surveyed.csv", header = T)
> surveyed_1 <- surveyed[, -c(1,2,3,5,6,7,9,10,11,12,13,15)]
> head(surveyed_1, n=1)
  VEGETATION                                        Pop_avg    Acres_1
1 Acer rubrum-Vaccinium corymbosum-Amelanchier spp.       0   27.68884

我尚未调查的地方:

> unsurveyed <- read.csv("summer_unsurveyed.csv", header = T)
> unsurveyed_1 <- unsurveyed[, -c(2,3,5,6,7,9,10,11,12,13,15)]
> head(unsurveyed_1, n=1)
OBJECTID                                       VEGETATION  Pop_avg   Acres_1
      13 Acer rubrum-Vaccinium corymbosum-Amelanchier spp.       0  4.787381

然后我删除了unsurveyed_1中包含surveyed_1中未找到的植被类型的行并删除了未使用的功能级别。

> setdiff(unsurveyed_1$VEGETATION, surveyed_1$VEGETATION) 

> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Typha (angustifolia, latifolia) - (Schoenoplectus spp.) Eastern Herbaceous Vegetation", ]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Acer rubrum- Nyssa sylvatica saturated forest alliance",]
> unsurveyed_1 <- unsurveyed_1[!unsurveyed_1$VEGETATION == "Prunus serotina",]

> unsurveyed_drop <- droplevels(unsurveyed_1)

接下来,我运行randomForest并预测并将输出添加到unsurveyed_drop

> surveyed_pred <- randomForest(Pop_avg ~ 
+ VEGETATION+Acres_1,
+ data = surveyed_1,
+ importance = TRUE)

> summer_results <- predict(surveyed_pred, unsurveyed_drop,type="response",
+ norm.votes=TRUE, predict.all=F, proximity=FALSE, nodes=FALSE)

> summer_all <- cbind(unsurveyed_drop, summer_results)
> head(summer_all, n=1)
OBJECTID                                        VEGETATION Pop_avg   Acres_1 summer_results
      13 Acer rubrum-Vaccinium corymbosum-Amelanchier spp.       0  4.787381       0.120077

我想估算Pop_avgsummer_all列的值。我假设我需要使用summer_results中生成的比例,但我不确定如何做到这一点。感谢您的帮助或进一步的建议。

更多信息: 我希望根据Pop_avgVegetation获取Acres_1的预测计数数据。我不确定是否/如何使用我的输出summer_results中的概率来实现这一点,或者我是否需要改变我的模型或尝试不同的方法。

E2 我不认为输出是正确的原因是因为Pop_avg范围从.333及以上(看到有鹿),Population除以3. Population范围从1开始(即10,20 ......)。当我运行模型试图预测任何一个时,我得到类似的数字,范围从.9xx到2或3.xxx,特别是当我用Population运行时。这似乎不对。

数据:
summer_surveyed_sample

summer_unsurveyed_sample

1 个答案:

答案 0 :(得分:1)

我的问题在我的训练模型中撒谎。我发现我需要使用我的调查数据的子集Population&gt; 0来获得更准确的预测。

> surveyed_1 <- surveyed_1[c(surveyed_1$Population > 0),]
> surveyed_drop <- droplevels(surveyed_1)
> surveyed_pred <- randomForest(Population ~ 
                VEGETATION+Acres_1,
                data = surveyed_drop,
                importance = TRUE)