查找句子是肯定的,中性的还是否定的?

时间:2019-12-29 16:21:28

标签: python machine-learning nlp nltk sentiment-analysis

我想创建一个脚本来查找句子是肯定的,中性的还是否定的。

我在线搜索发现,通过medium article可以使用NLTK库完成此操作。

所以,我尝试了这段代码。

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews


def extract_features(word_list):
    return dict([(word, True) for word in word_list])


if __name__ == '__main__':
    # Load positive and negative reviews
    positive_fileids = movie_reviews.fileids('pos')
    negative_fileids = movie_reviews.fileids('neg')

    features_positive = [(extract_features(movie_reviews.words(fileids=[f])),
                          'Positive') for f in positive_fileids]
    features_negative = [(extract_features(movie_reviews.words(fileids=[f])),
                          'Negative') for f in negative_fileids]

    # Split the data into train and test (80/20)
    threshold_factor = 0.8
    threshold_positive = int(threshold_factor * len(features_positive))
    threshold_negative = int(threshold_factor * len(features_negative))

    features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
    features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]
    print("\nNumber of training datapoints:", len(features_train))
    print("Number of test datapoints:", len(features_test))

    # Train a Naive Bayes classifier
    classifier = NaiveBayesClassifier.train(features_train)
    print("\nAccuracy of the classifier:", nltk.classify.util.accuracy(classifier, features_test))

    print("\nTop 10 most informative words:")
    for item in classifier.most_informative_features()[:10]:
        print(item[0])

    # Sample input reviews
    input_reviews = [
    "Started off as the greatest series of all time, but had the worst ending of all time.",
    "Exquisite. 'Big Little Lies' takes us to an incredible journey with its emotional and intriguing storyline.",
    "I love Brooklyn 99 so much. It has the best crew ever!!",
    "The Big Bang Theory and to me it's one of the best written sitcoms currently on network TV.",
    "'Friends' is simply the best series ever aired. The acting is amazing.",
    "SUITS is smart, sassy, clever, sophisticated, timely and immensely entertaining!",
    "Cumberbatch is a fantastic choice for Sherlock Holmes-he is physically right (he fits the traditional reading of the character) and he is a damn good actor",
    "What sounds like a typical agent hunting serial killer, surprises with great characters, surprising turning points and amazing cast."
    "This is one of the most magical things I have ever had the fortune of viewing.",
    "I don't recommend watching this at all!"
    ]

    print("\nPredictions:")
    for review in input_reviews:
        print("\nReview:", review)
        probdist = classifier.prob_classify(extract_features(review.split()))
        pred_sentiment = probdist.max()
        print("Predicted sentiment:", pred_sentiment)
        print("Probability:", round(probdist.prob(pred_sentiment), 2))

这是我得到的输出

Number of training datapoints: 1600
Number of test datapoints: 400

Accuracy of the classifier: 0.735

Top 10 most informative words:
outstanding
insulting
vulnerable
ludicrous
uninvolving
avoids
astounding
fascination
affecting
seagal

Predictions:

Review: Started off as the greatest series of all time, but had the worst ending of all time.
Predicted sentiment: Negative
Probability: 0.64

Review: Exquisite. 'Big Little Lies' takes us to an incredible journey with its emotional and intriguing storyline.
Predicted sentiment: Positive
Probability: 0.89

Review: I love Brooklyn 99 so much. It has the best crew ever!!
Predicted sentiment: Negative
Probability: 0.51

Review: The Big Bang Theory and to me it's one of the best written sitcoms currently on network TV.
Predicted sentiment: Positive
Probability: 0.62

Review: 'Friends' is simply the best series ever aired. The acting is amazing.
Predicted sentiment: Positive
Probability: 0.55

Review: SUITS is smart, sassy, clever, sophisticated, timely and immensely entertaining!
Predicted sentiment: Positive
Probability: 0.82

Review: Cumberbatch is a fantastic choice for Sherlock Holmes-he is physically right (he fits the traditional reading of the character) and he is a damn good actor
Predicted sentiment: Positive
Probability: 1.0

Review: What sounds like a typical agent hunting serial killer, surprises with great characters, surprising turning points and amazing cast.This is one of the most magical things I have ever had the fortune of viewing.
Predicted sentiment: Positive
Probability: 0.95

Review: I don't recommend watching this at all!
Predicted sentiment: Negative
Probability: 0.53

Process finished with exit code 0

我面临的问题是数据集非常有限,因此输出精度非常低。是否有更好的库或资源或其他任何方法检查语句是肯定的,中立的还是否定的?

更具体地说,我想将其应用于一般的日常对话中

2 个答案:

答案 0 :(得分:2)

Amazon客户评论数据集是一个庞大的数据集,包含130+百万个客户评论。您可以通过匹配评论和评分来将其用于情感分析。如此大量的数据也非常适合需要大量数据的深度学习方法。

https://s3.amazonaws.com/amazon-reviews-pds/readme.html

如果您特别搜索电影评论,那么大型电影评论数据集也是一个选择,它包含5万多个IMDB评论。 (http://ai.stanford.edu/~amaas/data/sentiment/

我建议使用单词嵌入而不是一字编码的单词袋来增强您的模型。

答案 1 :(得分:0)

已经有一些可用的语料库:

英语:

1)多域情感分析数据集:http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

2)IMDB评论:http://ai.stanford.edu/~amaas/data/sentiment/

3)斯坦福情感树库:http://nlp.stanford.edu/sentiment/code.html

4)情绪140:http://help.sentiment140.com/for-students/

5)Twitter美国航空情绪:https://www.kaggle.com/crowdflower/twitter-airline-sentiment

等,这里:50 free Machine Learning datasets: Sentiment Analysis和这里:nlpprogress

中文:

7)THUCNews:http://thuctc.thunlp.org/

8)头条:https://github.com/fate233/toutiao-text-classfication-dataset

9)SogouCA:https://www.sogou.com/labs/resource/ca.php

10)SogouCS:https://www.sogou.com/labs/resource/cs.php

here

一旦数据集足够大,就可以使用判别模型,因为对于较小的数据集,生成模型可以防止过度拟合,而对于较大的数据集,判别模型可以捕获生成模型无法捕获的依赖项(详细here) 。

It's说,树结构最好用较少的数据来建模情感,然后我认为我们可以在这里考虑the tree-structured LSTM