使用管道进行逻辑回归的文本分类

时间:2018-11-25 13:40:47

标签: python machine-learning sklearn-pandas

我正在尝试使用LogisticRegression进行文本分类。我将FeatureUnion的功能使用DataFrame,然后使用cross_val_score测试分类器的准确性。但是,我不知道如何在管道中包含带有称为tweets的自由文本的功能。我将TfidfVectorizer用于单词袋模型。

nominal_features = ["tweeter", "job", "country"]
numeric_features = ["age"]

numeric_pipeline = Pipeline([
    ("selector", DataFrameSelector(numeric_features))
])

nominal_pipeline = Pipeline([
    ("selector", DataFrameSelector(nominal_features)), 
     "onehot", OneHotEncoder()])

text_pipeline = Pipeline([
    ("selector", DataFrameSelector("tweets")),    
    ("vectorizer", TfidfVectorizer(stop_words='english'))])

pipeline = Pipeline([("union", FeatureUnion([("numeric_pipeline", numeric_pipeline),
                                             ("nominal_pipeline", nominal_pipeline)])), 
                                             ("estimator", LogisticRegression())])

np.mean(cross_val_score(pipeline, df, y, scoring="accuracy", cv=5))

这是在管道中包含tweets自由文本数据的正确方法吗?

1 个答案:

答案 0 :(得分:0)

pipeline = Pipeline([
('vect', CountVectorizer(stop_words='english',lowercase=True)),
("tfidf1", TfidfTransformer(use_idf=True,smooth_idf=True)),
('clf', MultinomialNB(alpha=1)) #Laplace smoothing
 ])

 train,test=train_test_split(df,test_size=.3,random_state=42, shuffle=True)
 pipeline.fit(train['Text'],train['Target'])

 predictions=pipeline.predict(test['Text'])
 print(test['Target'],predictions)

 score = f1_score(test['Target'],predictions,pos_label='positive',average='micro')
 print("Score of Naive Bayes is :" , score)