我在这个问题上困扰了很长时间,无法解决。我认为问题源于保留父信息的data.frame对象的子集,但我也认为在根据我认为只是我的训练集训练h2o.deeplearning模型时,这会引起问题(尽管可能并非如此)。请参阅下面的示例代码。我添加了注释以澄清我在做什么,但是它的代码很短:
input
.peek((key, value) ->{...}
.map((key, value) -> {...}
.groupByKey()
.windowedBy(TimeWindows.of(5000))
.aggregate(Initializer, Aggregator, Materialized) // disable caching via Materialized
.toStream()
.foreach(...) // react to every update to the KTable
问题是,如果我根据测试子集对此进行评估,则会得到几乎0%的错误:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
有人遇到过这个问题吗?有减轻该问题的想法吗?
答案 0 :(得分:1)
使用H2O功能加载数据并拆分数据将更加有效。
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
很难说出原始代码的问题所在,而无需查看dataset.csv的内容并能够尝试。我的猜测是训练和测试没有分开,实际上是根据测试数据进行训练。