在我第一次尝试使用随机森林分类器时,我收到了一组不同的回溯。
我使用的数据包含27个参数和一个“结果”列,用于训练模型。我的第一次尝试使用了11,000行数据,我把它作为测试数据集,因为实际上我希望看到更接近1,200行的数据集。但我收到了以下错误:
File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 53, in l1_cross_distances
D = np.zeros((n_nonzero_cross_dist, n_features))
ValueError: array is too big.
所以我将数据文件大小减少到5k行,并收到以下错误:
File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 53, in l1_cross_distances
D = np.zeros((n_nonzero_cross_dist, n_features))
MemoryError
最后我将数据文件大小减少到1k行,我仍然收到错误,与之前的错误不同:
File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 309, in fit
raise Exception("Multiple input features cannot have the same"
Exception: Multiple input features cannot have the same value.
我认为这与我正在使用的交叉验证功能有关:
# Using a custom cross-validation function
# the one in sklearn 0.14.1 has a bug. Otherwise I would have used
# sklearn.cross_validation.cross_val_score.
def crossValidation(model, X, Y, nfolds=10):
"""
Performs k-fold cross-validation. Takes as arguments an arbitrary
sklearn model, a training dataset (X, Y) and the number of folds.
"""
n = data.shape[0]
r = range(n)
shuffle(r)
scores = list()
X_folds = np.array_split(X[r], nfolds)
Y_folds = np.array_split(Y[r], nfolds)
for k in range(nfolds):
# We use 'list' to copy, in order to 'pop' later on
X_train = list(X_folds)
X_test = X_train.pop(k)
X_train = np.concatenate(X_train)
Y_train = list(Y_folds)
Y_test = Y_train.pop(k)
Y_train = np.concatenate(Y_train)
model.fit(X_train, Y_train)
y = model.predict(X_test)
score = metrics.mean_squared_error(y, Y_test)
scores.append(score)
return np.mean(scores)
任何想法或建议都会受到赞赏,请注意这是我第一次尝试运行随机森林分类器,所以我可能犯了一些新的错误。
遗憾的是,出于保密原因,我无法提供完整的代码,但是我希望以下任何一个代码段在将其传递给model.fit时对X_train和Y_train的形状进行编码?
def readCSV(path):
"""
Read a CSV file of floats, with no headder
"""
data = []
mycsv = csv.reader(open(path), delimiter="|")
for counter, row in enumerate(mycsv):
if counter != 0:
data.append(row)
return np.asarray(data, dtype=np.float32)
print np.asarray
data = readCSV("FullUnMergedDataWSPSR14TEST4RFDO.csv")
X = data[0:,:26]
Y = data[:, 27]
答案 0 :(得分:2)
您应该在第一个代码段中打印并检查n_nonzero_cross_dist
和n_features
的值。形状为(n_nonzero_cross_dist, n_features)
的2D numpy数组的预期大小为n_nonzero_cross_dist * n_features * 8 / 1e9
GB(如果dtype为np.int或np.float64)。您可以自己检查问题维度所需的内存。
此外,您的问题的标题和描述具有误导性或不正确性:您提供的错误是高斯过程模型,而不是随机森林。