Question

我正在尝试将所有矢量化的功能组合在一起。其中一些特征表示为稀疏矩阵。使用hstack组合所有功能时出现错误。

代码：

from scipy.sparse import hstack
a_train = hstack((categories_one_hot_train, sub_categories_one_hot_tr, text_bow_train, price_standardized_tr,title_bow_train))

a_test  = hstack((categories_one_hot_test, sub_categories_one_hot_test, text_bow_test, price_standardized_test,title_bow_test))

b_train = hstack((categories_one_hot_train, sub_categories_one_hot_tr,text_tfidf_train, price_standardized_tr,title_tfidf_train))

b_test  = hstack((categories_one_hot_test, sub_categories_one_hot_test,text_tfidf_test, price_standardized_test,title_tfidf_test))

c_train = hstack((categories_one_hot_train , sub_categories_one_hot_tr,avg_w2v_vectors_train, price_standardized_tr,avg_w2v_vectors_title_train))

c_test  = hstack((categories_one_hot_test , sub_categories_one_hot_test,avg_w2v_vectors_test, price_standardized_test,avg_w2v_vectors_title_test))

d_train = hstack((categories_one_hot_train, sub_categories_one_hot_tr,tfidf_w2v_vectors_train, price_standardized_tr,tfidf_w2v_vectors_title_train))

d_test  = hstack((categories_one_hot_test, sub_categories_one_hot_test,tfidf_w2v_vectors_test, price_standardized_test,tfidf_w2v_vectors_title_test))

错误消息：

MemoryError                               Traceback (most recent call last)
<ipython-input-55-b8d41d748e49> in <module>()
     17 set2_test  = hstack((categories_one_hot_test, sub_categories_one_hot_test,text_tfidf_test, price_standardized_test,title_tfidf_test))
     18 #set3 avg word2vec
---> 19 set3_train = hstack((categories_one_hot_train , sub_categories_one_hot_tr,avg_w2v_vectors_train, price_standardized_tr,avg_w2v_vectors_title_train))
     20 set3_test  = hstack((categories_one_hot_test , sub_categories_one_hot_test,avg_w2v_vectors_test, price_standardized_test,avg_w2v_vectors_title_test))
     21 #set4 tfidf word2vec

~/.local/lib/python3.6/site-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
    463 
    464     """
--> 465     return bmat([blocks], format=format, dtype=dtype)
    466 
    467 

~/.local/lib/python3.6/site-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
    572         for j in range(N):
    573             if blocks[i,j] is not None:
--> 574                 A = coo_matrix(blocks[i,j])
    575                 blocks[i,j] = A
    576                 block_mask[i,j] = True

~/.local/lib/python3.6/site-packages/scipy/sparse/coo.py in __init__(self, arg1, shape, dtype, copy)
    190             self.data = self.data.astype(dtype, copy=False)
    191 
--> 192         self._check()
    193 
    194     def reshape(self, *args, **kwargs):

~/.local/lib/python3.6/site-packages/scipy/sparse/coo.py in _check(self)
    272         idx_dtype = get_index_dtype(maxval=max(self.shape))
    273         self.row = np.asarray(self.row, dtype=idx_dtype)
--> 274         self.col = np.asarray(self.col, dtype=idx_dtype)
    275         self.data = to_native(self.data)
    276 

~/anaconda3/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    490 
    491     """
--> 492     return array(a, dtype, copy=False, order=order)
    493 
    494 

MemoryError:

我哪里出错了？

如果我无法解决此错误，是否还有其他函数可用于在对数据进行编码后将稀疏矩阵与普通矩阵组合在一起，以便我可以在它们上建立模型。

Answer 1

如错误所述，您的RAM不足以将这些数据帧彼此堆叠。我假设这些是非常大的数据集，当您以这种方式调用它们时，您将立即加载当前计算机无法处理的所有数据。

因此，为回答您的问题，您尝试一步加载所有这些数据，而不是使用批处理加载或本文中看到的某些fancy manipulation错了。

看到这种情况的主要原因是因为您反复在程序中调用categories_one_hot_train和sub_categories_one_hot_tr并为其分配内存。根据这些数据帧的大小，它很容易导致内存错误，因为Python会为每个称为它的实例分配内存。

一种更好的方法（如果没有关于您的数据的任何信息或每个数据集占用多少内存，就无法知道您是否有足够的内存甚至无法读取此数据），可以调用categories_one_hot_train和sub_categories_one_hot_tr一次，并将所有数据帧（a_train至d_test）构建为一个巨型数据帧。之后，您可以根据获取子数据框所需的列对数据框进行切片。

使用此方法，您只需调用categories_one_hot_train和sub_categories_one_hot_tr一次，而不是8次，因此Python将只需要为这些数据帧分配一次内存，而不是8次。

如果这不起作用，那么最好对数据的一部分进行训练和测试，以免耗尽内存。

使用hstack时Python中的内存错误问题

1 个答案: