Sklearn K-Fold Cross Validation Memory Issues

时间:2015-07-08 15:43:03

标签: python memory scikit-learn cross-validation

I'm trying to run some supervised experiments with a simple text classifier, but I'm running into memory issues in using the K Fold generator in Sklearn. The error I'm getting is states: "Your system has run out of application memory", but my dataset is only ~245K rows x ~81K columns. Large-ish, sure, but not huge. The program never terminates, but rather "hangs" until I manually shut down the terminal app. I've let it run like this for about 30 minutes, with no progress.

I've also written in print statements to see where in the cross validation for-loop the code gets stuck. It looks like the indices for training and test sets are generated, but the code never gets to the point of slicing off the actual training and test sets for features and labels using these indices. I'm running this on a Macbook Pro running 10.9.5. I've run this shutting down every other app except the Terminal app, with no success. Has anyone else had problems with this or is this likely something specific to my machine?

EDIT: I've run this with 10-fold and 5-fold cross validation and run into the same problems each time.

1 个答案:

答案 0 :(得分:4)

我认为第一个问题来自这一部分:

  

我的数据集只有~245K行x~81K列。很大,当然,但不是很大。

245K x 80K听起来不是很大,但我们只是做数学并假设每个元素存储8个字节。如果你的矩阵不稀疏(显然在你的情况下它是一个稀疏矩阵),那将是245 * 80 *需要存储在RAM中的8 MB大致 160 GB 。这实际上是巨大的!

你提到了文本分类,所以我猜你的功能是tf-idf或单词的数量,而且它非常稀疏。您现在需要警惕的是保持每个步骤的稀疏性,并且仅使用使用稀疏数据的算法,并且不会分配大小为n_samples * n_features的密集矩阵。< / p>

Naive bayes分类器(例如sklearn.naive_bayes.MultinomialNB)在文本分类方面取得了不错的成功,我会从那里开始。

这样的分类器可以轻松处理250K x 80K矩阵,只要它是稀疏矩阵(当然实际上足够稀疏)。

如果您仍想减少从tf-idf获得的功能数量,您有以下几种选择:

  1. 删除停用词,使用停用词列表或将max_df参数设置为0.7或更低的值(这将丢弃超过70%的文档中的任何术语)。
  2. 在训练分类器之前应用特征选择。 This scikit-learn example显示了如何使用卡方统计量来根据稀疏数据选择要素。
  3. 应用降维技术,例如SVD(我会查看Latent semantic indexing,但我对此并不擅长)。
  4. 选项1.和2.组合应该已经允许您显着减少功能的数量。

    如果有帮助,请告诉我。