我正在尝试对医学术语和非医学术语进行分类。我可以为一个分类报告提取单个不同的特征(例如字级,ngram级和字符级的计数向量和tfidf向量)。但是我不知道如何将这些功能结合在一起以生成一份报告。任何帮助将不胜感激。
Partially of my data look like these:
Label: Toknes:
term subdural
term hematoma
term non-insulin-dependent
term diabetes
term mellitus
non_term returning
non_term grasp
non_term farthest
non_term boil
non_term milk
path = 'data/data.csv'
dataset = pd.read_csv(path, header=None, names=['lable',
'tokens'])
dataset['lable_num'] = dataset.lable.map({'term':0, 'non_term':1})
X = dataset.tokens
y = dataset.lable_num
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=1)
vect = CountVectorizer()
X_train_dtm = vect.fit_transform(X_train.astype(str))
X_test_dtm = vect.transform(X_test.astype(str))
#tfidf_vect = TfidfVectorizer(analyzer='char',
#token_pattern=r'\w{1,}', ngram_range=(1, 3))
#X_train_tfidf_3gram =
#tfidf_vect.fit_transform(X_train.astype(str))
#X_test_tfidf_3gram = tfidf_vect.transform(X_test.astype(str))
nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)
y_pred_class = nb.predict(X_test_dtm)
print 'confusion metrics:\n', metrics.confusion_matrix(y_test,
y_pred_class)
print ''
print 'classification_report:\n',
metrics.classification_report(y_test, y_pred_class)