高斯过程算法出错,内存问题与numpy?

时间:2014-01-08 11:37:44

标签: python python-2.7 numpy scikit-learn gaussian

在我第一次尝试使用随机森林分类器时,我收到了一组不同的回溯。

我使用的数据包含27个参数和一个“结果”列,用于训练模型。我的第一次尝试使用了11,000行数据,我把它作为测试数据集,因为实际上我希望看​​到更接近1,200行的数据集。但我收到了以下错误:

File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 53, in l1_cross_distances
D = np.zeros((n_nonzero_cross_dist, n_features))
ValueError: array is too big.

所以我将数据文件大小减少到5k行,并收到以下错误:

File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 53, in l1_cross_distances
D = np.zeros((n_nonzero_cross_dist, n_features))
MemoryError

最后我将数据文件大小减少到1k行,我仍然收到错误,与之前的错误不同:

File "C:\Python27\lib\site-packages\scikit_learn-0.14.1-py2.7-win32.egg\sklearn\gaussian_process\gaussian_process.py", line 309, in fit
raise Exception("Multiple input features cannot have the same"
Exception: Multiple input features cannot have the same value.

我认为这与我正在使用的交叉验证功能有关:

# Using a custom cross-validation function
# the one in sklearn 0.14.1 has a bug. Otherwise I would have used
# sklearn.cross_validation.cross_val_score.
def crossValidation(model, X, Y, nfolds=10):
    """
    Performs k-fold cross-validation. Takes as arguments an arbitrary
    sklearn model, a training dataset (X, Y) and the number of folds.
    """
    n = data.shape[0]
    r = range(n)
    shuffle(r)
    scores = list()
    X_folds = np.array_split(X[r], nfolds)
    Y_folds = np.array_split(Y[r], nfolds)
    for k in range(nfolds):
        # We use 'list' to copy, in order to 'pop' later on
        X_train = list(X_folds)
        X_test  = X_train.pop(k)
        X_train = np.concatenate(X_train)
        Y_train = list(Y_folds)
        Y_test  = Y_train.pop(k)
        Y_train = np.concatenate(Y_train)
        model.fit(X_train, Y_train)
        y = model.predict(X_test)
        score = metrics.mean_squared_error(y, Y_test)
        scores.append(score)
    return np.mean(scores)

任何想法或建议都会受到赞赏,请注意这是我第一次尝试运行随机森林分类器,所以我可能犯了一些新的错误。

编辑以回复评论:

遗憾的是,出于保密原因,我无法提供完整的代码,但是我希望以下任何一个代码段在将其传递给model.fit时对X_train和Y_train的形状进行编码?

Snippet1

def readCSV(path):
    """
    Read a CSV file of floats, with no headder 
    """
    data = []
    mycsv = csv.reader(open(path), delimiter="|")
    for counter, row in enumerate(mycsv):
        if counter != 0:
            data.append(row)
    return np.asarray(data, dtype=np.float32)
print np.asarray

Snippet2

data = readCSV("FullUnMergedDataWSPSR14TEST4RFDO.csv")
X = data[0:,:26]
Y = data[:, 27]

1 个答案:

答案 0 :(得分:2)

您应该在第一个代码段中打印并检查n_nonzero_cross_distn_features的值。形状为(n_nonzero_cross_dist, n_features)的2D numpy数组的预期大小为n_nonzero_cross_dist * n_features * 8 / 1e9 GB(如果dtype为np.int或np.float64)。您可以自己检查问题维度所需的内存。

此外,您的问题的标题和描述具有误导性或不正确性:您提供的错误是高斯过程模型,而不是随机森林。