一条同时符合文本和分类功能的管道

时间:2019-09-10 09:36:54

标签: python python-3.x machine-learning scikit-learn

我正在尝试找到一种方法,使用一个管道来转换文本特征和分类特征,然后将它们适合于分类器。

下面的工作示例(为简化可读性 )是我当前使用的方法。

我必须拆分为3个迷你管道或变量:

  1. 第一个将对分类特征进行编码,
  2. 第二个将Tfidf Vectorizer应用于raw_text功能
  3. 第三个将使分类器适合合并的数据(使用hstack合并两个功能之后)
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import scipy

raw_text_tr = ["kjndn ndoabn mbba odb ob b dboa \n onbn abf  ppfjpfap",
            "ùodnaionf àjùfnàehna nbn obeùfoenen",
            "ùodnaionf àjùfnàehna nbn obeùfoenen dfa e g aze",
            "fjp ,fj)jea ghàhàhà àhàtgjjaz çujàh e ghghàugàh çàéhg \n\n\n\n oddn duhodd"]
categorie_tr = ["cat1","cat2","cat2","cat4"]
target_tr = ["no","no","no","yes"]

raw_text_te = ["ldkdl jaoldldj doizd test yes ok manufajddk p",
            "\n\n\n dopj pdjj pdjaj ada  ohdha hdçh dmamad ldidl h dohdodz"]
categorie_te = ["cat3","cat5"]

train_df = pd.DataFrame(data=list(zip(raw_text_tr, categorie_tr, target_tr)),columns=["raw_text_ft","categorical_ft","target"])
test_df = pd.DataFrame(data=list(zip(raw_text_te, categorie_te)),columns=["raw_text_ft","categorical_ft"])
print(train_df)
#                                          raw_text_ft categorical_ft target
# 0  kjndn ndoabn mbba odb ob b dboa \n onbn abf  p...           cat1     no
# 1                ùodnaionf àjùfnàehna nbn obeùfoenen           cat2     no
# 2    ùodnaionf àjùfnàehna nbn obeùfoenen dfa e g aze           cat2     no
# 3  fjp ,fj)jea ghàhàhà àhàtgjjaz çujàh e ghghàugà...           cat4    yes

print(test_df)
#                                          raw_text_ft categorical_ft
# 0      ldkdl jaoldldj doizd test yes ok manufajddk p           cat3
# 1  \n\n\n dopj pdjj pdjaj ada  ohdha hdçh dmamad ...           cat5

pipeline_tfidf = Pipeline([("tfidf",TfidfVectorizer())])
pipeline_enc = Pipeline([("enc",OneHotEncoder(handle_unknown="ignore"))])
pipeline_clf = Pipeline([("clf",LogisticRegression())])

A_tr = pipeline_tfidf.fit_transform(train_df["raw_text_ft"])
B_tr = pipeline_enc.fit_transform(train_df["categorical_ft"].values.reshape(-1,1))
X_train = scipy.sparse.hstack([A_tr,B_tr])

A_te = pipeline_tfidf.transform(test_df["raw_text_ft"])
B_te = pipeline_enc.transform(test_df["categorical_ft"].values.reshape(-1,1))
X_test = scipy.sparse.hstack([A_te,B_te])

pipeline_clf.fit(X_train, train_df["target"])

是否有一种更干净的方法将所有这些步骤放在一个管道中?

下面是我想象中的管道,但目前无法正常工作,我正在使用FeatureUnion在分类之前合并两个转换后的特征

pipeline_tot = Pipeline([
  ('features', FeatureUnion([
    ('tfidf', TfidfVectorizer()),
    ('enc', OneHotEncoder(handle_unknown="ignore"))
  ])),
  ('clf', LogisticRegression())
])

最困难的部分是在拟合管道时如何拆分文本和分类特征(我只能给pipeline_tot.fit()函数一个元素)

1 个答案:

答案 0 :(得分:1)

FeatureUnion概括了每个应用于整个功能集的转换,而 ColumnTransformer将转换分别应用于特定功能 您指定的子集:

>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import OneHotEncoder
>>> preprocessor = ColumnTransformer(
...     transformers=[
...         ('text', TfidfVectorizer(), 'raw_text_ft'), #TfidfVectorizer accepts column name only between quotes
...         ('category', OneHotEncoder(), ['categorical_ft']),
...     ],
... )
>>> pipe = Pipeline(
...     steps=[
...         ('preprocessor', preprocessor),
...         ('classifier', LogisticRegression()),
...     ],
... )