如何在csv表中使用注释来获取文本分类?

时间:2017-12-29 16:38:07

标签: python csv nltk

如果有人喜欢某个产品,我会发表几条声明我已对评论进行了匿名处理,并使用了其他名称。

由于分类问题与电影评论非常相似,我完全依赖http://www.nltk.org/book/ch06.html的教程。第1.3节。

 from nltk.corpus import movie_reviews
 documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
 random.shuffle(documents)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

word_features = list(all_words)[:2000] [1]

def document_features(document): [2]
    document_words = set(document) [3]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

到目前为止一切顺利。

我担心的是我的数据是按照以下方式组织的。

sex;statement;income_dollar_year
"m";"I REALLY like milk. It taste soo good and is healthy. Everyone should drink it";40000
"f";"Milk itself is tasty, but I don't like how cows are kept in huge stables";30000
"m";"I don't like milk, I have intolerance against lactose, so my stomach pains";35000

我只想要标签说明声明一和二是积极的,第三一是否定的。我也知道我必须删除数据集中没有的所有单词,否则我会有零,这会让我陷入麻烦。

由于我不知道movies_review中的数据是如何组织完整的,所以我没有找到解决方案。

我的想法是获得这样的csv:

sex;statement;income_dollar_year;class
"m";"I REALLY like milk. It taste soo good and is healthy. Everyone should drink it";40000;"pos"
"f";"Milk itself is tasty, but I don't like how cows are kept in huge stables";30000;"pos"
"m";"I don't like milk, I have intolerance against lactose, so my stomach pains";35000;"neg"

0 个答案:

没有答案