机器学习,在处理文本数据时如何将多个功能组合到一个预测模型中

时间:2019-09-06 03:42:37

标签: python-2.7 machine-learning scikit-learn text-classification

我正在尝试对医学术语和非医学术语进行分类。我可以为一个分类报告提取单个不同的特征(例如字级,ngram级和字符级的计数向量和tfidf向量)。但是我不知道如何将这些功能结合在一起以生成一份报告。任何帮助将不胜感激。

Partially of my data look like these:

Label:        Toknes:
term         subdural
term         hematoma
term         non-insulin-dependent
term         diabetes
term         mellitus
non_term     returning
non_term     grasp
non_term     farthest
non_term     boil
non_term     milk


path = 'data/data.csv'
dataset = pd.read_csv(path, header=None, names=['lable', 
'tokens'])

dataset['lable_num'] = dataset.lable.map({'term':0, 'non_term':1})

X = dataset.tokens
y = dataset.lable_num
X_train, X_test, y_train, y_test = train_test_split(X, y, 
random_state=1)

vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train.astype(str))
X_test_dtm = vect.transform(X_test.astype(str))

#tfidf_vect = TfidfVectorizer(analyzer='char', 
#token_pattern=r'\w{1,}', ngram_range=(1, 3))
#X_train_tfidf_3gram = 
#tfidf_vect.fit_transform(X_train.astype(str))
#X_test_tfidf_3gram = tfidf_vect.transform(X_test.astype(str))

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)

print 'confusion metrics:\n', metrics.confusion_matrix(y_test, 
y_pred_class)
print ''
print 'classification_report:\n', 
metrics.classification_report(y_test, y_pred_class)

0 个答案:

没有答案