Python3文本标签

时间:2017-02-28 10:49:01

标签: python-3.x scikit-learn classification

我不知道,从哪里开始这个问题,因为我现在学习神经网络。我有一个带句子的大数据库>标签对。例如:

i want take a photo < photo
i go to take a photo < photo
i go to use my camera < photo
i go to eat something < eat
i like my food < eat

如果用户写了一个新句子,我想检查所有标签准确度得分:

&#34;我使用相机后上床睡觉#34; &LT;照片:0.9000,吃:0.4000,......

所以这个问题,我在哪里可以开始? Tensorflow和scikit学习看起来不错,但是这个文档分类没有显示准​​确性:\

1 个答案:

答案 0 :(得分:1)

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

sentences = ["i want take a photo", "i go to take a photo", "i go to use my camera", "i go to eat something", "i like my food"]

labels = ["photo", "photo", "photo", "eat", "eat"]

tfv = TfidfVectorizer()

# Fit TFIDF
tfv.fit(traindata)
X =  tfv.transform(traindata) 

lbl = LabelEncoder()
y = lbl.fit_transform(labels)

xtrain, xtest, ytrain, ytest = cross_validation.train_test_split(X, y, stratify=y, random_state=42)

clf = LogisitcRegression()
clf.fit(xtrain, ytrain)
predictions = clf.predict(xtest)

print "Accuracy Score = ", metrics.accuracy_score(ytest, predictions)

获取新数据:

new_sentence = ["this is a new sentence"]
X_Test = tfv.transform(new_sentence)
print clf.predict_proba(X_Test)