合并到具有不同列维度的numpy数组

时间:2013-12-01 15:59:44

标签: python numpy machine-learning

对于机器学习任务,我正在寻找一种方法来合并具有不同尺寸的两个特征矩阵,以便我可以将它们都提供给估算器。我不能使用scipy合并方法,因为这些方法需要兼容的形状。我可以使用numpy合并方法,但是当我实际尝试拆分数组进行交叉验证时出错了。错误如下所示:

 Traceback (most recent call last):
  File "C:\Users\Ano\workspace\final_submission\src\linearSVM.py", line 50, in <module>
    result = ridge(train_text,train_labels,test_set,train_state,test_state)
  File "C:\Users\Ano\workspace\final_submission\src\Algorithms.py", line 90, in ridge
    x_train, x_test, y_train, y_test = cross_validation.train_test_split(train, labels, test_size = 0.2, random_state = 42)
  File "C:\Python27\lib\site-packages\sklearn\cross_validation.py", line 1394, in train_test_split
    arrays = check_arrays(*arrays, **options)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 211, in check_arrays
    % (size, n_samples))
ValueError: Found array with dim 77946. Expected 2

发生此错误的原因是我在另一个stackoverflow问题线程中发现:Concatenate sparse matrices in Python using SciPy/Numpy。显然np.vstack / hstack创建了两个矩阵对象,这导致了我的错误。

我正在处理的形状:

(77946, 63677)

(77946, 55)

基本上,我正在寻找一种方法,将每个样本的55个额外特征从第二个矩阵附加到第一个矩阵中的特征。

我还试图创建一个具有适当尺寸的numpy数组,并简单地用特征矩阵填充它,但即使创建该矩阵也会给我一个内存错误。我试图将它转换为稀疏矩阵,但这也不起作用。也许我在那里做错了什么?

new_matrix = sparse.csr_matrix(np.zeros((77946,63727)))
new_matrix[:,0:63676] = big_feature_matrix
new_matrix[:,63677:63727] = small_feature_matrix

更新 所以尝试了Jaime的解决方案,但它给了我一个错误:

涉及的代码

def feature_extraction(train,test,train_small,test_small):


    vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",ngram_range = (1,2))

    cv = CountVectorizer(strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}')


    print("fitting Vectorizer")
    vectorizer.fit(train)
    train_small = cv.fit_transform(train_state)
    test_small = cv.transform(test_state)
    print("transforming text")
    train = vectorizer.transform(train)
    test = vectorizer.transform(test)

    new_train = sparse.hstack((train, train_small),
                                 format='csr')
    new_test = sparse.hstack((test, test_small),
                                 format='csr')


    return new_train,new_test

完全追溯

Traceback (most recent call last):
  File "C:\Users\Ano\workspace\final_submission\src\linearSVM.py", line 50, in <module>
    result = ridge(train_text,train_labels,test_set,train_small,test_small)
  File "C:\Users\Ano\workspace\final_submission\src\Algorithms.py", line 89, in ridge
    train,test = feature_extraction(train,test,train_small,test_small)
  File "C:\Users\Ano\workspace\final_submission\src\Preprocessing.py", line 109, in feature_extraction
    format='csr')
  File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 423, in hstack
    return bmat([blocks], format=format, dtype=dtype)
  File "C:\Python27\lib\site-packages\scipy\sparse\construct.py", line 523, in bmat
    raise ValueError('blocks[%d,:] has incompatible row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions

列车组具有与以前相同的尺寸。测试集的样本较少(42157)。

更新

Jaime的解决方案,确实有效,当我加载文件时,我只是搞砸了,谢谢你的帮助!

1 个答案:

答案 0 :(得分:4)

您可以使用scipy.sparse.hstack

new_matrix = scipy.sparse.hstack((big_feature_matrix, small_feature_matrix),
                                 format='csr')