在尝试使用新的ColumnTransformer功能时,我尝试使用SKLearn 0.20.2制作管道。我的问题是,当我运行分类器:clf.fit(x_train, y_train)
时,我不断收到错误消息:
ValueError: all the input array dimensions except for the concatenation axis must match exactly
我有一列名为text
的文本块。我所有其他专栏本质上都是数字。我正在尝试在管道中使用Countvectorizer,我认为这就是麻烦所在。对此非常感谢。
在运行管道并检查x_train / y_train后,它看起来很有帮助(省略了通常在左栏中显示的行号,而文本列比图片中的行高)。
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# plus other necessary modules
# mapped to column names from dataframe
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median'))
])
# mapped to column names from dataframe
text_features = ['text']
text_transformer = Pipeline(steps=[
('vect', CountVectorizer())
])
preprocessor = ColumnTransformer(
transformers=[('num', numeric_transformer, numeric_features),('text', text_transformer, text_features)]
)
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', MultinomialNB())
])
x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)
clf.fit(x_train,y_train)
答案 0 :(得分:0)
我想如果您需要了解或调试代码,则不应该使用Pipeline
。问题出在您的text_transformer
上。 numeric_transformer
的输出符合预期:
# example
df = pd.DataFrame([['(0,17569)\t1\n(0,8779)\t0\n', 1, 13, 1, 0],
['(0,16118)\t1\n(0,9480)\t1\n', 1, None, 0, 1],
['(0,123)\t1\n(0,456)\t1\n', 1, 15, 0, 0]],
columns=('text', 'hasDate', 'iterationCount', 'hasItemNumber', 'isEpic'))
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = SimpleImputer(strategy='median')
num = numeric_transformer.fit_transform(df[numeric_features])
print(num)
#[[ 1. 13. 1. 0.]
# [ 1. 14. 0. 1.]
# [ 1. 15. 0. 0.]]
但是text_transformer
为您提供了形状为(1, 1)
的数组。因此,您需要弄清楚如何转换text
列:
text_features = ['text']
text_transformer = CountVectorizer()
text = text_transformer.fit_transform(df[text_features])
print(text_transformer.get_feature_names())
print(text.toarray())
#['text']
#[[1]]
答案 1 :(得分:0)
Vadim是正确的,如果您运行此代码
NetworkInterface.GetIsNetworkAvailable()
输出将如下所示。
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = SimpleImputer(strategy='median')
num = numeric_transformer.fit_transform(df[numeric_features])
# num.shape
# (3, 4)
text_features = ['text']
text_transformer = CountVectorizer()
text = text_transformer.fit_transform(df[text_features])
print(text_transformer.get_feature_names())
print(text.toarray())
这是由于我在文本处理过程中遇到了一些不便之处。
如果您将text_features定义为字符串而不是一个元素列表
['text']
[[1]]
成为这个
text_features = 'text'
text_transformer = CountVectorizer()
text = text_transformer.fit_transform(df[text_features])
print(text_transformer.get_feature_names())
print(text.toarray())`
您想要的是什么。
将列名作为列表放置会使CountVectorizer出于某种原因仅看到一项