我正在做一些nlp分类,我想做一个堆栈合奏。
我的原始数据包含每个类的不同描述级别。例如,对于1个实例,我们最初可以有一个带有名称的列,一个带有简短描述的列,一个带有其子类别描述的列,等等。
我上面的代码中的X_train是每列包含所有单词的某些粒度的地方。例如。第一列可能是简短描述,第二列可能是子类别描述的单词,也可能来自另一个来源,第三列可能是更多来自更细分类的单词。
我将pipe
的工作流程包括在内,pipe_2
包装在StackingClassifier
中,因为这是我要尝试的工作,但是如果我只是尝试独立运行pipe_1
(直接安装在pipe_1
上)。
我尝试更改X_train
和y_train
的格式(使用ravel()
和.tolist()
),但是我认为也许当管道正在使用ColumnSelector
,但我不确定该如何处理。
X_train(<class 'pandas.core.frame.DataFrame'>
和y_train(<class 'pandas.core.series.Series'>)
的类型与我成功进行非堆叠运行时的类型相同。为了成功运行,传递给fit方法的是<class 'scipy.sparse.csr.csr_matrix'>
。我想在堆叠示例中也是如此,如果我期望TfidfVectorizer
能做到这一点。我看到的主要区别(而且我认为也许有一行以上的列,每行可能会产生有问题的numpy.ndarray
?)是对于堆叠一列,X_train
有多列。但是我本来希望ColumnSelector
中的make_pipeline
“负责这个问题”。
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import ColumnSelector
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from mlxtend.classifier import StackingClassifier
# creating my toy trainset and testset
start = [
['apple this is painful Two wrongs make a right ok',
'just a batch of suspicious words and banana',
'another batch of fake words and another apple'],
['Fortune favors the italic sunny sunshine',
'name of a company and then its description',
'is it all sunshine or doomed to fail to no sunshine'],
['this was it when in rome do as the romans and make fortune',
'well again the same thing and those descriptions',
'lets make that work and bring the fortune'],
['Ok this is the last one and then its the end',
'is it the beggining of the end or the end of the beggining',
'allelouia']
]
X_train = pd.DataFrame(
start, columns=['High_level', 'Mid_level', 'Low_level'])
y_train = ['A', 'B', 'C', 'D']
X_test = pd.DataFrame([['mostly apple'], ['bunch of apple'],
['lot of fortune'], ['make fortune and bring the'],
['beggining of the end']])
y_true = ['A', 'A', 'C', 'C', 'D']
错误出现在下一行:
pipe_1 = make_pipeline(ColumnSelector(cols=(1,)), TfidfVectorizer(min_df=1),
LogisticRegression(multi_class='multinomial'))
pipe_2 = make_pipeline(ColumnSelector(cols=(2,)), TfidfVectorizer(min_df=1),
LogisticRegression(multi_class='multinomial'))
sclf = StackingClassifier(
classifiers=[pipe_1, pipe_2],
meta_classifier=LogisticRegression(
solver='lbfgs', multi_class='multinomial',
C=1.0, class_weight='balanced', tol=1e-6, max_iter=1000,
n_jobs=-1))
predictions = sclf.fit(X_train, y_train).predict(X_test)
这是完整的错误:
Traceback (most recent call last):
File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
predictions = sclf.fit(X_train, y_train).predict(X_test)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
clf.fit(X, y)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
**fit_params_steps[name])
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
return self.func(*args, **kwargs)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
X = super().fit_transform(raw_documents)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 256, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
Process finished with exit code 1
如果我在lowercase=False
中更改为TfidfVectorizer
,则会收到另一种错误:
Traceback (most recent call last):
File "C:/Users/inf10926/PycharmProjects/profiling/venv/lab.py", line 52, in <module>
predictions = sclf.fit(X_train, y_train).predict(X_test)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\mlxtend\classifier\stacking_classification.py", line 161, in fit
clf.fit(X, y)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 352, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 317, in _fit
**fit_params_steps[name])
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\joblib\memory.py", line 355, in __call__
return self.func(*args, **kwargs)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\pipeline.py", line 716, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1652, in fit_transform
X = super().fit_transform(raw_documents)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 1058, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 970, in _count_vocab
for feature in analyze(doc):
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 352, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Users\inf10926\PycharmProjects\profiling\venv\lib\site-packages\sklearn\feature_extraction\text.py", line 265, in <lambda>
return lambda doc: token_pattern.findall(doc)
TypeError: cannot use a string pattern on a bytes-like object
答案 0 :(得分:0)
我面临着同样的问题。我通过将drop_axis = True
添加到ColumnSelector
来解决它。仅选择一列时,需要添加此参数。
请在此处参考API:http://rasbt.github.io/mlxtend/user_guide/feature_selection/ColumnSelector/#api