我正在使用NLTK的Naive Bayes分类器进行分类分析。我插入一个包含记录和标签的tsv文件。
但是由于错误,文件没有得到训练。这是我的python代码
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('tweets.txt', delimiter ='\t', quoting = 3)
dataset.isnull().any()
dataset = dataset.fillna(method='ffill')
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,16004):
tweet = re.sub('[^a-zA-Z]', ' ', dataset['tweet'][i])
tweet = tweet.lower()
tweet = tweet.split()
ps = PorterStemmer()
tweet = [ps.stem(word) for word in tweet if not word in
set(stopwords.words('english'))]
tweet = ' '.join(tweet)
corpus.append(tweet)
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 10000)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20,
random_state = 0)
train_set, test_set = X_train[500:], y_train[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)
错误是:
File "C:\Users\HSR\Anaconda2\lib\site-packages\nltk\classify\naivebayes.py", line 194, in train
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack
答案 0 :(得分:0)
NLTKClassifier并不像scikit estimators那样工作。它需要X
和y
在一个数组中,然后传递给train()
。
但是在你的代码中,你只是向它提供X_train并且它试图从那里解包y
并因此错误。
NaiveBayesClassifier要求输入是元组列表,其中list表示训练样本,元组内部有特征字典和标签。类似的东西:
X = [({feature1:'val11', feature2:'val12' .... }, class1),
({feature1:'val21', feature2:'val22' .... }, class2),
...
... ]
您需要将输入更改为此格式。
feature_names = cv.get_feature_names()
train_set = []
for i, single_sample in enumerate(X):
single_feature_dict = {}
for j, single_feature in enumerate(single_sample):
single_feature_dict[feature_names[j]]=single_feature
train_set.append((single_feature_dict, y[i]))
注意:上面的for循环可以通过使用dict理解来缩短,但我不是那么流利。
然后你可以这样做:
nltk.NaiveBayesClassifier.train(train_set)