我知道之前已经问过类似的问题,我已经尝试了其中的所有建议,但我仍然难过。我有一个包含2列的数据集:第一列的向量表示存储为1x10000稀疏csr矩阵的单词(因此每个单元格中有一个矩阵),第二列包含我将用于分类的整数等级。当我运行以下代码时
for index, row in data.iterrows():
print(row)
print(row[0].shape)
我得到了所有行的正确输出
Name: 0, dtype: object
(1, 10000)
Vector (0, 0)\t1.0\n (0, 1)\t1.0\n (0, 2)\t1.0\n ...
Rating 5
现在,当我尝试在任何SKlearn分类器中传递数据时,如下所示:
uniform_random_classifier = DummyClassifier(strategy='uniform')
uniform_random_classifier.fit(data["Vectors"], data["Ratings"])
我收到以下错误:
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
我做错了什么?我确保所有稀疏矩阵的大小都相同,我尝试以各种方式重塑数据,但没有运气,Sklearn分类器应该能够处理csr矩阵。
更新:将整个“向量”列转换为一个大的2-D矩阵就可以了,但为了完整起见,以下是我用来生成数据帧的代码,如果有人好奇的话想尝试解决原始问题。假设 data 是一个pandas数据框,其行显示为
“560 420 222”5.0
“2345 2344 2344 5”3.0
def vectorize(feature, size):
"""Given a numeric string generated from a vocabulary table return a binary vector representation of
each feature"""
vector = sparse.lil_matrix((1, size))
for number in feature.split(' '):
try:
vector[0, int(number) - 1] = 1
except ValueError:
pass
return vector
def vectorize_dataset(data, vectorize, size):
"""Given a dataset in the appropriate "num num num..." format, a specific vectorization format, and a vector size,
returns the dataset in vectorized form"""
result_data = pd.DataFrame(index=range(data.shape[0]), columns=["Vector", "Rating"])
for index, row in data.iterrows():
# All the mixing up of decodings and encoding has made it so that Pandas incorrectly parses EOF chars
if type(row[0]) == type('str'):
result_data.iat[index, 0] = vectorize(row[0], size).tocsr()
result_data.iat[index, 1] = data.loc[index][1]
return result_data