如果有人喜欢某个产品,我会发表几条声明我已对评论进行了匿名处理,并使用了其他名称。
由于分类问题与电影评论非常相似,我完全依赖http://www.nltk.org/book/ch06.html的教程。第1.3节。
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]
def document_features(document): [2]
document_words = set(document) [3]
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
到目前为止一切顺利。
我担心的是我的数据是按照以下方式组织的。
sex;statement;income_dollar_year
"m";"I REALLY like milk. It taste soo good and is healthy. Everyone should drink it";40000
"f";"Milk itself is tasty, but I don't like how cows are kept in huge stables";30000
"m";"I don't like milk, I have intolerance against lactose, so my stomach pains";35000
我只想要标签说明声明一和二是积极的,第三一是否定的。我也知道我必须删除数据集中没有的所有单词,否则我会有零,这会让我陷入麻烦。
由于我不知道movies_review中的数据是如何组织完整的,所以我没有找到解决方案。
我的想法是获得这样的csv:
sex;statement;income_dollar_year;class
"m";"I REALLY like milk. It taste soo good and is healthy. Everyone should drink it";40000;"pos"
"f";"Milk itself is tasty, but I don't like how cows are kept in huge stables";30000;"pos"
"m";"I don't like milk, I have intolerance against lactose, so my stomach pains";35000;"neg"