Sklearn:带有ColumnTransformer的文本和数字功能具有值错误

时间:2019-02-05 19:12:44

标签: python machine-learning scikit-learn

在尝试使用新的ColumnTransformer功能时,我尝试使用SKLearn 0.20.2制作管道。我的问题是,当我运行分类器:clf.fit(x_train, y_train)时,我不断收到错误消息:

ValueError: all the input array dimensions except for the concatenation axis must match exactly

我有一列名为text的文本块。我所有其他专栏本质上都是数字。我正在尝试在管道中使用Countvectorizer,我认为这就是麻烦所在。对此非常感谢。

在运行管道并检查x_train / y_train后,它看起来很有帮助(省略了通常在左栏中显示的行号,而文本列比图片中的行高)。


from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# plus other necessary modules

# mapped to column names from dataframe
numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median'))
])

# mapped to column names from dataframe
text_features = ['text']
text_transformer = Pipeline(steps=[
    ('vect', CountVectorizer())
])

preprocessor = ColumnTransformer(
    transformers=[('num', numeric_transformer, numeric_features),('text', text_transformer, text_features)]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', MultinomialNB())
                     ])

x_train, x_test, y_train, y_test = train_test_split(features, labels, test_size=0.33)
clf.fit(x_train,y_train)

2 个答案:

答案 0 :(得分:0)

我想如果您需要了解或调试代码,则不应该使用Pipeline。问题出在您的text_transformer上。 numeric_transformer的输出符合预期:

# example
df = pd.DataFrame([['(0,17569)\t1\n(0,8779)\t0\n', 1, 13, 1, 0],
                   ['(0,16118)\t1\n(0,9480)\t1\n', 1, None, 0, 1],
                   ['(0,123)\t1\n(0,456)\t1\n', 1, 15, 0, 0]],
                  columns=('text', 'hasDate', 'iterationCount', 'hasItemNumber', 'isEpic'))

numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = SimpleImputer(strategy='median')

num = numeric_transformer.fit_transform(df[numeric_features])

print(num)

#[[ 1. 13.  1.  0.]
# [ 1. 14.  0.  1.]
# [ 1. 15.  0.  0.]]

但是text_transformer为您提供了形状为(1, 1)的数组。因此,您需要弄清楚如何转换text列:

text_features = ['text']
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())

#['text']
#[[1]]

答案 1 :(得分:0)

Vadim是正确的,如果您运行此代码

NetworkInterface.GetIsNetworkAvailable()

输出将如下所示。

numeric_features = ['hasDate', 'iterationCount', 'hasItemNumber', 'isEpic']
numeric_transformer = SimpleImputer(strategy='median')

num = numeric_transformer.fit_transform(df[numeric_features])

# num.shape  
# (3, 4)

text_features = ['text']
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())

这是由于我在文本处理过程中遇到了一些不便之处。

如果您将text_features定义为字符串而不是一个元素列表

['text']
[[1]]

成为这个

text_features = 'text'
text_transformer = CountVectorizer()

text = text_transformer.fit_transform(df[text_features])

print(text_transformer.get_feature_names())
print(text.toarray())`

您想要的是什么。

将列名作为列表放置会使CountVectorizer出于某种原因仅看到一项