火车功能使用R插入符号挂起

时间:2019-07-14 06:44:33

标签: r machine-learning r-caret

哇,

尝试使用插入号创建随机森林,我的训练命令使该过程挂起至少24小时(此后我将其关闭)。这个想法是将土地覆盖数据与当地温度相关联,因此我有一组excel文件,其中包含传感器读数和一个描述每个位置的土地覆盖的csv。这只是确保所有运动部件正常工作的试运行,因此数据集非常小:只有六个位置的土地覆盖条件不变。这可能是问题的根源,因为它是在我编写以下代码以创建数据集之后开始的:

library(openxlsx)
library(caret)
library(randomForest)
library(plyr)
library(dplyr)
library(RStoolbox)
library(ggplot2)
library(doParallel)
library(caTools)
library(gdata)

# Get land cover data
sensor_environment_folder <- "C://etc"
land_cover <- read.csv(paste0(sensor_environment_folder, "Logger_Simplified.csv"))

# Get the list of sensor logs and split it into two lists of files
sensor_reading_folder <- "C://etc2"
files <- list.files(sensor_reading_folder)
split <- sample.split(files, SplitRatio= 4/5, group=NULL)
training_files <- files[split]
testing_files <- files[!split]

# Load raw data

for(file in training_files) {
  if(!exists("training_raw")) {
    training_raw <- read.xlsx(paste0(sensor_reading_folder,file), colNames=TRUE)
  } else {
    temp_data <- read.xlsx(paste0(sensor_reading_folder,file), colNames=TRUE)
    training_raw <- rbind(training_raw,temp_data)
    rm(temp_data)
  }
}

for(file in testing_files) {
  if(!exists("testing_raw")) {
    testing_raw <- read.xlsx(paste0(sensor_reading_folder,file), colNames=TRUE)
  } else {
    temp_data <- read.xlsx(paste0(sensor_reading_folder,file), colNames=TRUE)
    testing_raw <- rbind(testing_raw, temp_data)
    rm(temp_data)
  }
}

数据准备工作包括根据公用UtilID(传感器名称)和XY坐标将温度读数与土地覆盖数据结合起来,然后我将其丢弃:

# Prepare the data

model_vars<-c("Point_X", "Point_Y", "UtilID", "DateTime", "TempF")
training_merge <- merge(training_raw[model_vars],land_cover)
training_data <- subset(
              training_merge[complete.cases(training_merge),],
              select=c(TempF,DateTime, S_10000Pct.N.19.11, S_2000Pct.N.19.11))
testing_merge <- merge(testing_raw[model_vars],land_cover)
testing_data <- subset(
              testing_merge[complete.cases(testing_merge),],
              select=c(TempF,DateTime, S_10000Pct.N.19.11, S_2000Pct.N.19.11))

rm(training_merge, testing_merge)

到目前为止,所有这些都运行良好,并且数据集看起来像我期望的那样:

>str(training_data)
'data.frame':   431663 obs. of  4 variables:
 $ TempF             : num  52.9 52.8 52.5 52.7 52.6 ...
 $ DateTime          : num  43387 43387 43387 43387 43387 ...
 $ S_10000Pct.N.19.11: num  18.7 18.7 18.7 18.7 18.7 ...
 $ S_2000Pct.N.19.11 : num  16 16 16 16 16 ...

但是当它进入训练功能时,它会发出消息“”并挂起:

# Train the algorithm

myControl <- trainControl(method="repeatedcv",
                          number=2,
                          repeats=2,
                          returnResamp='all',
                          allowParallel=TRUE)

mc <- makeCluster(detectCores())
registerDoParallel

set.seed(1999)

learner <- train(TempF ~ DateTime + S_10000Pct.N.19.11 + S_2000Pct.N.19.11,
                  data=training_data,
                  method="rf",
                  metric="Rsquared",
                  preProc=c("center","scale"),
                  trControl=myControl,
                  verbose = TRUE)

stopCluster(mc)

我发生了一些可能的根本原因:

  • 由于数据太少,该算法会无限期地搜索以进行准确预测(一种可怜的东西)。能够告诉它完全对此类事情感到高兴。

  • 训练变量名称有问题吗? (我是R语法的新手,把句号放在名称中间确实很可笑)

  • 数据集本身有些奇怪,这不会令我惊讶。到目前为止,我遇到了一些问题,我通过选出最明显的罪魁祸首来对此做出回应。

已将此与类似的投诉进行了比较,但是most similar报告似乎是由于缺乏某些我所拥有的代码。

0 个答案:

没有答案