Question

我正在做一些nlp分类，我想做一个堆栈合奏。

我的原始数据包含每个类的不同描述级别。例如，对于1个实例，我们最初可以有一个带有名称的列，一个带有简短描述的列，一个带有其子类别描述的列，等等。

我上面的代码中的X_train是每列包含所有单词的某些粒度的地方。例如。第一列可能是简短描述，第二列可能是子类别描述的单词，也可能来自另一个来源，第三列可能是更多来自更细分类的单词。

我将pipe的工作流程包括在内，pipe_2包装在StackingClassifier中，因为这是我要尝试的工作，但是如果我只是尝试独立运行pipe_1（直接安装在pipe_1上）。

我尝试更改X_train和y_train的格式（使用ravel()和.tolist()），但是我认为也许当管道正在使用ColumnSelector，但我不确定该如何处理。

X_train(<class 'pandas.core.frame.DataFrame'>和y_train(<class 'pandas.core.series.Series'>)的类型与我成功进行非堆叠运行时的类型相同。为了成功运行，传递给fit方法的是<class 'scipy.sparse.csr.csr_matrix'>。我想在堆叠示例中也是如此，如果我期望TfidfVectorizer能做到这一点。我看到的主要区别（而且我认为也许有一行以上的列，每行可能会产生有问题的numpy.ndarray？）是对于堆叠一列，X_train有多列。但是我本来希望ColumnSelector中的make_pipeline“负责这个问题”。

import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import ColumnSelector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier

# creating my toy trainset and testset
start = [
    ['apple this is painful Two wrongs make a right ok',
     'just a batch of suspicious words and banana',
     'another batch of fake words and another apple'],
    ['Fortune favors the italic sunny sunshine',
     'name of a company and then its description',
     'is it all sunshine or doomed to fail to no sunshine'],
    ['this was it when in rome do as the romans and make fortune',
     'well again the same thing and those descriptions',
     'lets make that work and bring the fortune'],
    ['Ok this is the last one and then its the end',
     'is it the beggining of the end or the end of the beggining',
     'allelouia']
]

X_train = pd.DataFrame(
    start, columns=['High_level', 'Mid_level', 'Low_level'])
y_train = ['A', 'B', 'C', 'D']
X_test = pd.DataFrame([['mostly apple'], ['bunch of apple'],
                       ['lot of fortune'], ['make fortune and bring the'],
                       ['beggining of the end']])
y_true = ['A', 'A', 'C', 'C', 'D']

错误出现在下一行：

pipe_1 = make_pipeline(ColumnSelector(cols=(1,)), TfidfVectorizer(min_df=1),
                     LogisticRegression(multi_class='multinomial'))
pipe_2 = make_pipeline(ColumnSelector(cols=(2,)), TfidfVectorizer(min_df=1),
                     LogisticRegression(multi_class='multinomial'))
sclf = StackingClassifier(
        classifiers=[pipe_1, pipe_2],
        meta_classifier=LogisticRegression(
            solver='lbfgs', multi_class='multinomial',
            C=1.0, class_weight='balanced', tol=1e-6, max_iter=1000,
            n_jobs=-1))
predictions = sclf.fit(X_train, y_train).predict(X_test)

这是完整的错误：

Traceback (most recent call last):
  File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
    predictions = sclf.fit(X_train, y_train).predict(X_test)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
    clf.fit(X, y)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
    **fit_params_steps[name])
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
    X = super().fit_transform(raw_documents)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

Process finished with exit code 1

如果我在lowercase=False中更改为TfidfVectorizer，则会收到另一种错误：

Traceback (most recent call last):
  File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
    predictions = sclf.fit(X_train, y_train).predict(X_test)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
    clf.fit(X, y)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
    Xt, fit_params = self._fit(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
    **fit_params_steps[name])
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
    return self.func(*args, **kwargs)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
    X = super().fit_transform(raw_documents)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
    self.fixed_vocabulary_)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
    for feature in analyze(doc):
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 265, in <lambda>
    return lambda doc: token_pattern.findall(doc)
TypeError: cannot use a string pattern on a bytes-like object

Answer 1

我面临着同样的问题。我通过将drop_axis = True添加到ColumnSelector来解决它。仅选择一列时，需要添加此参数。

请在此处参考API：http://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/#api

AttributeError：'numpy.ndarray'对象在管道中没有属性'lower'

1 个答案: