Sklearn和稀疏矩阵ValueError

时间:2018-02-22 17:48:04

标签: numpy scipy scikit-learn sklearn-pandas

我知道之前已经问过类似的问题,我已经尝试了其中的所有建议,但我仍然难过。我有一个包含2列的数据集:第一列的向量表示存储为1x10000稀疏csr矩阵的单词(因此每个单元格中有一个矩阵),第二列包含我将用于分类的整数等级。当我运行以下代码时

for index, row in data.iterrows():
    print(row)
    print(row[0].shape)

我得到了所有行的正确输出

Name: 0, dtype: object
(1, 10000)
Vector      (0, 0)\t1.0\n  (0, 1)\t1.0\n  (0, 2)\t1.0\n ...
Rating                                                    5

现在,当我尝试在任何SKlearn分类器中传递数据时,如下所示:

uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])

我收到以下错误:

    array = np.array(array, dtype=dtype, order=order, copy=copy)
     ValueError: setting an array element with a sequence.

我做错了什么?我确保所有稀疏矩阵的大小都相同,我尝试以各种方式重塑数据,但没有运气,Sklearn分类器应该能够处理csr矩阵。

更新:将整个“向量”列转换为一个大的2-D矩阵就可以了,但为了完整起见,以下是我用来生成数据帧的代码,如果有人好奇的话想尝试解决原始问题。假设 data 是一个pandas数据框,其行显示为

“560 420 222”5.0

“2345 2344 2344 5”3.0

def vectorize(feature, size):

"""Given a numeric string generated from a vocabulary table return a binary vector representation of
 each feature"""

vector = sparse.lil_matrix((1, size))

for number in feature.split(' '):
    try:
        vector[0, int(number) - 1] = 1
    except ValueError:
        pass

return vector

def vectorize_dataset(data, vectorize, size):

"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""

result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])

for index, row in data.iterrows():

    # All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
    if type(row[0]) == type('str'):
        result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
        result_data.iat[index, 1] = data.loc[index][1]

return result_data

0 个答案:

没有答案