我有从doc2vec算法创建的浮点数向量及其标签。当我使用简单的分类器时,它可以正常工作并提供预期的准确性。工作代码如下:
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np
train_vecs #ndarray (20418,100)
#train_vecs = [[0.3244, 0.3232, -0.5454, 1.4543, ...],...]
y_train #labels
test_vecs #ndarray (6885,100)
y_test #labels
classifier = LinearSVC()
classifier.fit(train_vecs, y_train )
print('Test Accuracy: %.2f'%classifier.score(test_vecs, y_test))
但是现在我想把它移到一个管道中,因为将来我打算用不同的功能做一个功能联合。我所做的是将矢量移动到数据帧中,然后使用2个自定义变换器来i)选择列,ii)更改数组类型。奇怪的是,完全相同的数据,具有完全相同的形状,dtype和类型..给出0.0005的准确度。对我来说根本没有意义,它应该给出几乎相同的准确性。在ArrayCaster变换器之后,输入的形状和类型与之前完全相同。整件事情真是令人沮丧。
from sklearn.svm import LinearSVC
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
# transformer that picks a column from the dataframe
class ItemSelector(BaseEstimator, TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
print('item selector type',type(X[self.column]))
print('item selector shape',len(X[self.column]))
print('item selector dtype',X[self.column].dtype)
return (X[self.column])
# transformer that converts the series into an ndarray
class ArrayCaster(BaseEstimator, TransformerMixin):
def fit(self, x, y=None):
return self
def transform(self, data):
print('array caster type',type(np.array(data.tolist())))
print('array caster shape',np.array(data.tolist()).shape)
print('array caster dtype',np.array(data.tolist()).dtype)
return np.array(data.tolist())
train_vecs #ndarray (20418,100)
y_train #labels
test_vecs #ndarray (6885,100)
y_test #labels
train['vecs'] = pd.Series(train_vecs.tolist())
val['vecs'] = pd.Series(test_vecs.tolist())
classifier = Pipeline([
('selector', ItemSelector(column='vecs')),
('array', ArrayCaster()),
('clf',LinearSVC())])
classifier.fit(train, y_train)
print('Test Accuracy: %.2f'%classifier.score(test, y_test))
答案 0 :(得分:0)
对不起抱歉..我明白了。注意到这个错误非常烦人。我所要做的就是将它们作为列表投射并将它们放入数据框中,而不是将它们转换为系列。 改变这个
train['vecs'] = pd.Series(train_vecs.tolist())
val['vecs'] = pd.Series(test_vecs.tolist())
成:
train['vecs'] = list(train_vecs)
val['vecs'] = list(test_vecs)