哇,
尝试使用插入号创建随机森林,我的训练命令使该过程挂起至少24小时(此后我将其关闭)。这个想法是将土地覆盖数据与当地温度相关联,因此我有一组excel文件,其中包含传感器读数和一个描述每个位置的土地覆盖的csv。这只是确保所有运动部件正常工作的试运行,因此数据集非常小:只有六个位置的土地覆盖条件不变。这可能是问题的根源,因为它是在我编写以下代码以创建数据集之后开始的:
library(openxlsx)
library(caret)
library(randomForest)
library(plyr)
library(dplyr)
library(RStoolbox)
library(ggplot2)
library(doParallel)
library(caTools)
library(gdata)
# Get land cover data
sensor_environment_folder <- "C://etc"
land_cover <- read.csv(paste0(sensor_environment_folder, "Logger_Simplified.csv"))
# Get the list of sensor logs and split it into two lists of files
sensor_reading_folder <- "C://etc2"
files <- list.files(sensor_reading_folder)
split <- sample.split(files, SplitRatio= 4/5, group=NULL)
training_files <- files[split]
testing_files <- files[!split]
# Load raw data
for(file in training_files) {
if(!exists("training_raw")) {
training_raw <- read.xlsx(paste0(sensor_reading_folder,file), colNames=TRUE)
} else {
temp_data <- read.xlsx(paste0(sensor_reading_folder,file), colNames=TRUE)
training_raw <- rbind(training_raw,temp_data)
rm(temp_data)
}
}
for(file in testing_files) {
if(!exists("testing_raw")) {
testing_raw <- read.xlsx(paste0(sensor_reading_folder,file), colNames=TRUE)
} else {
temp_data <- read.xlsx(paste0(sensor_reading_folder,file), colNames=TRUE)
testing_raw <- rbind(testing_raw, temp_data)
rm(temp_data)
}
}
数据准备工作包括根据公用UtilID(传感器名称)和XY坐标将温度读数与土地覆盖数据结合起来,然后我将其丢弃:
# Prepare the data
model_vars<-c("Point_X", "Point_Y", "UtilID", "DateTime", "TempF")
training_merge <- merge(training_raw[model_vars],land_cover)
training_data <- subset(
training_merge[complete.cases(training_merge),],
select=c(TempF,DateTime, S_10000Pct.N.19.11, S_2000Pct.N.19.11))
testing_merge <- merge(testing_raw[model_vars],land_cover)
testing_data <- subset(
testing_merge[complete.cases(testing_merge),],
select=c(TempF,DateTime, S_10000Pct.N.19.11, S_2000Pct.N.19.11))
rm(training_merge, testing_merge)
到目前为止,所有这些都运行良好,并且数据集看起来像我期望的那样:
>str(training_data)
'data.frame': 431663 obs. of 4 variables:
$ TempF : num 52.9 52.8 52.5 52.7 52.6 ...
$ DateTime : num 43387 43387 43387 43387 43387 ...
$ S_10000Pct.N.19.11: num 18.7 18.7 18.7 18.7 18.7 ...
$ S_2000Pct.N.19.11 : num 16 16 16 16 16 ...
但是当它进入训练功能时,它会发出消息“”并挂起:
# Train the algorithm
myControl <- trainControl(method="repeatedcv",
number=2,
repeats=2,
returnResamp='all',
allowParallel=TRUE)
mc <- makeCluster(detectCores())
registerDoParallel
set.seed(1999)
learner <- train(TempF ~ DateTime + S_10000Pct.N.19.11 + S_2000Pct.N.19.11,
data=training_data,
method="rf",
metric="Rsquared",
preProc=c("center","scale"),
trControl=myControl,
verbose = TRUE)
stopCluster(mc)
我发生了一些可能的根本原因:
由于数据太少,该算法会无限期地搜索以进行准确预测(一种可怜的东西)。能够告诉它完全对此类事情感到高兴。
训练变量名称有问题吗? (我是R语法的新手,把句号放在名称中间确实很可笑)
数据集本身有些奇怪,这不会令我惊讶。到目前为止,我遇到了一些问题,我通过选出最明显的罪魁祸首来对此做出回应。
已将此与类似的投诉进行了比较,但是most similar报告似乎是由于缺乏某些我所拥有的代码。