我正在尝试找到一种方法,使用一个管道来转换文本特征和分类特征,然后将它们适合于分类器。
下面的工作示例(为简化可读性 )是我当前使用的方法。
我必须拆分为3个迷你管道或变量:
hstack
合并两个功能之后)from sklearn.preprocessing import FunctionTransformer, OneHotEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression import scipy raw_text_tr = ["kjndn ndoabn mbba odb ob b dboa \n onbn abf ppfjpfap", "ùodnaionf àjùfnàehna nbn obeùfoenen", "ùodnaionf àjùfnàehna nbn obeùfoenen dfa e g aze", "fjp ,fj)jea ghàhàhà àhàtgjjaz çujàh e ghghàugàh çàéhg \n\n\n\n oddn duhodd"] categorie_tr = ["cat1","cat2","cat2","cat4"] target_tr = ["no","no","no","yes"] raw_text_te = ["ldkdl jaoldldj doizd test yes ok manufajddk p", "\n\n\n dopj pdjj pdjaj ada ohdha hdçh dmamad ldidl h dohdodz"] categorie_te = ["cat3","cat5"] train_df = pd.DataFrame(data=list(zip(raw_text_tr, categorie_tr, target_tr)),columns=["raw_text_ft","categorical_ft","target"]) test_df = pd.DataFrame(data=list(zip(raw_text_te, categorie_te)),columns=["raw_text_ft","categorical_ft"]) print(train_df) # raw_text_ft categorical_ft target # 0 kjndn ndoabn mbba odb ob b dboa \n onbn abf p... cat1 no # 1 ùodnaionf àjùfnàehna nbn obeùfoenen cat2 no # 2 ùodnaionf àjùfnàehna nbn obeùfoenen dfa e g aze cat2 no # 3 fjp ,fj)jea ghàhàhà àhàtgjjaz çujàh e ghghàugà... cat4 yes print(test_df) # raw_text_ft categorical_ft # 0 ldkdl jaoldldj doizd test yes ok manufajddk p cat3 # 1 \n\n\n dopj pdjj pdjaj ada ohdha hdçh dmamad ... cat5 pipeline_tfidf = Pipeline([("tfidf",TfidfVectorizer())]) pipeline_enc = Pipeline([("enc",OneHotEncoder(handle_unknown="ignore"))]) pipeline_clf = Pipeline([("clf",LogisticRegression())]) A_tr = pipeline_tfidf.fit_transform(train_df["raw_text_ft"]) B_tr = pipeline_enc.fit_transform(train_df["categorical_ft"].values.reshape(-1,1)) X_train = scipy.sparse.hstack([A_tr,B_tr]) A_te = pipeline_tfidf.transform(test_df["raw_text_ft"]) B_te = pipeline_enc.transform(test_df["categorical_ft"].values.reshape(-1,1)) X_test = scipy.sparse.hstack([A_te,B_te]) pipeline_clf.fit(X_train, train_df["target"])
是否有一种更干净的方法将所有这些步骤放在一个管道中?
下面是我想象中的管道,但目前无法正常工作,我正在使用FeatureUnion
在分类之前合并两个转换后的特征
pipeline_tot = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer()),
('enc', OneHotEncoder(handle_unknown="ignore"))
])),
('clf', LogisticRegression())
])
最困难的部分是在拟合管道时如何拆分文本和分类特征(我只能给pipeline_tot.fit()函数一个元素)
答案 0 :(得分:1)
FeatureUnion
概括了每个应用于整个功能集的转换,而
ColumnTransformer
将转换分别应用于特定功能
您指定的子集:
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.preprocessing import OneHotEncoder
>>> preprocessor = ColumnTransformer(
... transformers=[
... ('text', TfidfVectorizer(), 'raw_text_ft'), #TfidfVectorizer accepts column name only between quotes
... ('category', OneHotEncoder(), ['categorical_ft']),
... ],
... )
>>> pipe = Pipeline(
... steps=[
... ('preprocessor', preprocessor),
... ('classifier', LogisticRegression()),
... ],
... )