我正在尝试使用scikit learn实现文本分类解决方案。
我已经能够获得简单文本分类的结果。现在我想在预测过程中添加另一个特征(非文本) - 以提高准确性。
我的数据集如下:
代码:
import pandas as pd
import sklearn
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.svm import LinearSVC
path = 'sunny_day.xlsx'
sms = pd.read_excel(path,header = None, names = ['label', 'message','number_feature'])
#convert labels to a numeric value using a map and give it new column 'label_num'
sms['label_num'] = sms.label.map({'greeting' : 0, 'Goodbye' : 1, 'Sandwich' : 2})
X = sms.message
y = sms.label_num
z = sms.number_feature
# train test split
X_train = np.array(X[0:9])
X_test = np.array(X[9:])
y_train = np.array(y[0:9])
y_test = np.array(y[9:])
z_train = np.array(z[0:9])
z_test = np.array(z[9:])
def get_z(x):
if np.array_equal(x, np.array(X_train)):
return np.array(z_train).reshape(-1,1)
else:
return np.array(z_test).reshape(-1,1)
classifier = Pipeline([
('features', FeatureUnion([
('text',Pipeline([
('vectorizer', CountVectorizer()),
])),
('length', Pipeline([
('count', FunctionTransformer(get_z, validate = False)),
]))
])),
('clf',OneVsRestClassifier(LinearSVC()))])
classifier.fit(X_train, y_train)
y_pred_class = classifier.predict(X_test)
y_pred_class
正如各篇文章中所提到的,我已经使用FeatureUnion来实现这一目标。 然而,我得到的准确性 - 在应用'操纵'之前number_feature功能甚至在它之后 - 是66.67%。
为什么在给出有偏见的功能时,准确度似乎没有提高?
数据集:
标签|消息| feature_number
greeting How are you? 5
greeting How is your day? 5
greeting Good day 5
greeting How is it going today? 5
Goodbye Have a nice day 4
Goodbye See you later 4
Goodbye Have a nice day 4
Goodbye Talk to you soon 4
Sandwich Make me a sandwich. 2
Sandwich Can you make a sandwich 2
Sandwich Having a sandwich today? 2
Sandwich what’s for lunch 2