我正在尝试使用LogisticRegression
进行文本分类。我将FeatureUnion
的功能使用DataFrame
,然后使用cross_val_score
测试分类器的准确性。但是,我不知道如何在管道中包含带有称为tweets
的自由文本的功能。我将TfidfVectorizer
用于单词袋模型。
nominal_features = ["tweeter", "job", "country"]
numeric_features = ["age"]
numeric_pipeline = Pipeline([
("selector", DataFrameSelector(numeric_features))
])
nominal_pipeline = Pipeline([
("selector", DataFrameSelector(nominal_features)),
"onehot", OneHotEncoder()])
text_pipeline = Pipeline([
("selector", DataFrameSelector("tweets")),
("vectorizer", TfidfVectorizer(stop_words='english'))])
pipeline = Pipeline([("union", FeatureUnion([("numeric_pipeline", numeric_pipeline),
("nominal_pipeline", nominal_pipeline)])),
("estimator", LogisticRegression())])
np.mean(cross_val_score(pipeline, df, y, scoring="accuracy", cv=5))
这是在管道中包含tweets
自由文本数据的正确方法吗?
答案 0 :(得分:0)
pipeline = Pipeline([
('vect', CountVectorizer(stop_words='english',lowercase=True)),
("tfidf1", TfidfTransformer(use_idf=True,smooth_idf=True)),
('clf', MultinomialNB(alpha=1)) #Laplace smoothing
])
train,test=train_test_split(df,test_size=.3,random_state=42, shuffle=True)
pipeline.fit(train['Text'],train['Target'])
predictions=pipeline.predict(test['Text'])
print(test['Target'],predictions)
score = f1_score(test['Target'],predictions,pos_label='positive',average='micro')
print("Score of Naive Bayes is :" , score)