R中使用子集数据帧的h2o包问题导致近乎完美的预测精度

时间:2018-11-23 19:07:32

标签: r h2o

我在这个问题上困扰了很长时间,无法解决。我认为问题源于保留父信息的data.frame对象的子集,但我也认为在根据我认为只是我的训练集训练h2o.deeplearning模型时,这会引起问题(尽管可能并非如此)。请参阅下面的示例代码。我添加了注释以澄清我在做什么,但是它的代码很短:

input
  .peek((key, value) ->{...}
  .map((key, value) -> {...}
  .groupByKey()
  .windowedBy(TimeWindows.of(5000))
  .aggregate(Initializer, Aggregator, Materialized) // disable caching via Materialized
  .toStream()
  .foreach(...) // react to every update to the KTable

问题是,如果我根据测试子集对此进行评估,则会得到几乎0%的错误:

dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors

X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset

train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices

h2o.init(nthreads=2) # Initiate h2o

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)


predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions

mean(predictions!=y[test]) 

有人遇到过这个问题吗?有减轻该问题的想法吗?

1 个答案:

答案 0 :(得分:1)

使用H2O功能加载数据并拆分数据将更加有效。

data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data))  #Learn from all the other columns
data[,y] = as.factor(data[,y])

parts = h2o.splitFrame(data, 0.8)  #Split 80/20
train = parts[[1]]
test = parts[[2]]

# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
                           hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)

h2o.performance(mlModel, test)

很难说出原始代码的问题所在,而无需查看dataset.csv的内容并能够尝试。我的猜测是训练和测试没有分开,实际上是根据测试数据进行训练。