如何在当前的单词分类中添加另一个文本功能?在Scikit-learn中

时间:2018-05-03 22:39:35

标签: python machine-learning scikit-learn nlp text-classification

这是我的输入矩阵enter image description here

我的示例代码:

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(data['Extract'], 
data['Expense Account code Description'], random_state = 0)

from sklearn.pipeline import Pipeline , FeatureUnion
text_clf = Pipeline([('vect', CountVectorizer(ngram_range=(1,1))),
              ('tfidf', TfidfTransformer(use_idf = False)),
              ('clf', RandomForestClassifier(n_estimators =100, 
 max_features='log2',criterion = 'entropy')),
 ])
 text_clf = text_clf.fit(X_train, y_train)

这里我正在为“提取”列分类“费用帐户代码说明”应用Bag of word模型,这里我的准确度大约为92%,但如果我想将“供应商名称”包含为另一个输入功能我该怎么做。有什么方法可以和一袋字一起做吗? ,

1 个答案:

答案 0 :(得分:1)

您可以使用FeatureUnion。 您还需要创建一个新的Transformer类,其中包含您需要采取的必要操作,例如Include Vendor name,get dummies。

功能联盟将适合您的管道。

供参考。

class get_Vendor(BaseEstimator,TransformerMixin):

    def transform(self, X,y):
        return 

lr_tfidf = Pipeline([('features',FeatureUnion([('other',get_vendor()),
        ('vect', tfidf)])),('clf', RandomForestClassifier())])