我已遵循此site对我的数据集使用朴素贝叶斯算法。在这里,数据集分为两个文件,一个是review.txt,另一个是label.txt。我在这里使用了“ train_test_split”功能。
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
with open("/Users/abc/review.txt") as f:
reviews = f.read().split("\n")
with open("/Users/abc/label.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score)
predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)
print("precision_score :" , precision_score)
print("recall_score :" , recall_score)
但是,现在我的要求是将数据集放在单个文件中(审阅,标签)。而且我需要分别手动提供测试和培训数据。因此,相应地实现了代码。
但是,我无法在此处使用“ onehot_enc”。它会引发错误,因为从“ load_data”函数返回的评论是单词列表。
谁能建议我如何使用“ onehot_enc”为我的数据集实现代码...
因此,为此,我使用了以下代码:
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('/Users/abc/train_data.csv')
X_test, y_test = load_data('/Users/abc/test_data.csv')
答案 0 :(得分:0)
如果我正确理解,您想要的是标记您的评论以使用朴素贝叶斯。 一种热编码用于标签或分类数据。
您应该在标签上使用0和1(而不是肯定和否定),但不要在评论中使用
对于您的文本,有sklearn内置的用于标记化的功能,通常CountVectorizer可能在这里可用。
我建议看一下follwing link,它详细说明了如何处理文本。