Question

前言，这段代码来自Github / Youtube上的一个好人： https://github.com/the-javapocalypse/

我做了一些小的调整供个人使用。

我自己和推特上的情绪分析之间始终存在的一个问题是，存在如此多的机器人帖子。我想如果我完全无法避免机器人，也许我可以删除重复以对冲影响。

例如 - “#bitcoin”或“#btc” - Bot帐户存在于发布相同确切推文的许多不同句柄下。它可以说“它正在登月！现在购买#btc或永远后悔！买，买，买！这是我个人网站的链接[在这里插入个人网站网址]”

这似乎是一个积极的情绪发布。如果25个帐户每个帐户发布2次，如果我只分析最近包含“#btc”的500条推文，我们会有一些通货膨胀

所以我的问题：

在写入csv文件之前删除重复的有效方法是什么？我在考虑输入一个简单的if语句并指向一个数组来检查它是否已经存在。这有一个问题。假设我输入1000条推文进行分析。如果其中500个是机器人的重复，我的1000推文分析刚刚成为501推文分析。这导致了我的下一个问题
什么是包含复制检查的方法，如果有重复，每次添加1我的总要求分析推文。示例 - 我想分析1000条推文。发现重复一次，因此在分析中包含999个独特的推文。我希望脚本再分析一下，使其成为1000条独特的推文（1001条推文，包括1条重复）
小改动，但我认为知道如何删除嵌入了超链接的所有推文会很有效。这将通过补偿丢弃的超链接推文来实现问题2的目标。示例 - 我想分析1000条推文。 1000个中的500个具有嵌入的URL。 500从分析中删除。我现在已经发了500条推文。我仍然想要1000.脚本需要继续获取非URL，非重复，直到已经考虑了1000个唯一的非URL推文。

请参阅下面的整个脚本：

import tweepy
import csv
import re
from textblob import TextBlob
import matplotlib.pyplot as plt


class SentimentAnalysis:

    def __init__(self):
        self.tweets = []
        self.tweetText = []

    def DownloadData(self):
        # authenticating
        consumerKey = ''
        consumerSecret = ''
        accessToken = ''
        accessTokenSecret = ''
        auth = tweepy.OAuthHandler(consumerKey, consumerSecret)
        auth.set_access_token(accessToken, accessTokenSecret)
        api = tweepy.API(auth)

        # input for term to be searched and how many tweets to search
        searchTerm = input("Enter Keyword/Tag to search about: ")
        NoOfTerms = int(input("Enter how many tweets to search: "))

        # searching for tweets
        self.tweets = tweepy.Cursor(api.search, q=searchTerm, lang="en").items(NoOfTerms)

        csvFile = open('result.csv', 'a')

        csvWriter = csv.writer(csvFile)

        # creating some variables to store info
        polarity = 0
        positive = 0
        negative = 0
        neutral = 0

        # iterating through tweets fetched
        for tweet in self.tweets:
            # Append to temp so that we can store in csv later. I use encode UTF-8
            self.tweetText.append(self.cleanTweet(tweet.text).encode('utf-8'))
            analysis = TextBlob(tweet.text)
            # print(analysis.sentiment)  # print tweet's polarity
            polarity += analysis.sentiment.polarity  # adding up polarities

            if (analysis.sentiment.polarity == 0):  # adding reaction
                neutral += 1
            elif (analysis.sentiment.polarity > 0.0):
                positive += 1
            else:
                negative += 1

        csvWriter.writerow(self.tweetText)
        csvFile.close()

        # finding average of how people are reacting
        positive = self.percentage(positive, NoOfTerms)
        negative = self.percentage(negative, NoOfTerms)
        neutral = self.percentage(neutral, NoOfTerms)

        # finding average reaction
        polarity = polarity / NoOfTerms

        # printing out data
        print("How people are reacting on " + searchTerm +
              " by analyzing " + str(NoOfTerms) + " tweets.")
        print()
        print("General Report: ")

        if (polarity == 0):
            print("Neutral")
        elif (polarity > 0.0):
            print("Positive")
        else:
            print("Negative")

        print()
        print("Detailed Report: ")
        print(str(positive) + "% positive")
        print(str(negative) + "% negative")
        print(str(neutral) + "% neutral")

        self.plotPieChart(positive, negative, neutral, searchTerm, NoOfTerms)

    def cleanTweet(self, tweet):
        # Remove Links, Special Characters etc from tweet
        return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t]) | (\w +:\ / \ / \S +)", " ", tweet).split())

    # function to calculate percentage
    def percentage(self, part, whole):
        temp = 100 * float(part) / float(whole)
        return format(temp, '.2f')

    def plotPieChart(self, positive, negative, neutral, searchTerm, noOfSearchTerms):
        labels = ['Positive [' + str(positive) + '%]', 'Neutral [' + str(neutral) + '%]',
                  'Negative [' + str(negative) + '%]']
        sizes = [positive, neutral, negative]
        colors = ['yellowgreen', 'gold', 'red']
        patches, texts = plt.pie(sizes, colors=colors, startangle=90)
        plt.legend(patches, labels, loc="best")
        plt.title('How people are reacting on ' + searchTerm +
                  ' by analyzing ' + str(noOfSearchTerms) + ' Tweets.')
        plt.axis('equal')
        plt.tight_layout()
        plt.show()


if __name__ == "__main__":
    sa = SentimentAnalysis()
    sa.DownloadData()

Answer 1

回答第一个问题

您可以使用此衬纸删除重复项。

self.tweets = list(set(self.tweets))

这将删除所有重复的推文。以防万一，如果您希望它正常工作，这是一个简单的示例

>>> tweets = ['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> print(tweets)
['this is a tweet', 'this is a tweet', 'Yet another Tweet', 'this is a tweet']
>>> tweets = list(set(tweets))
>>> print(tweets)
['this is a tweet', 'Yet another Tweet']

回答第二个问题

由于您已经删除了重复项，因此可以通过计算self.tweets和NoOfTerms的差值来获取已删除的tweet的数量

tweets_to_further_scrape = NoOfTerms - self.tweets

现在，您可以抓取tweets_to_further_scrape条Tweet，并重复此删除重复和抓取的过程，直到找到所需数量的唯一Tweet。

回答第三个问题

迭代推文列表时，添加此行以删除外部链接。

tweet.text = ' '.join([i for i in tweet.text.split() if 'http' not in i])

希望这会对您有所帮助。编码愉快！

Answer 2

您可以使用defaultdict简单地保持推文实例的运行计数。您可能也想要删除网址，以防它们爆破新的缩短网址。

from collections import defaultdict

def __init__(self):
    ...
    tweet_count = defaultdict(int)

def track_tweet(self, tweet):
    t = self.clean_tweet(tweet)
    self.tweet_count[t] += 1

def clean_tweet(self, tweet):
    t = tweet.lower()
    # any other tweet normalization happens here, such as dropping URLs
    return t

def DownloadData(self):
    ...
    for tweet in self.tweets:
        ...
        # add logic to check for number of repeats in the dictionary.

Twitter情感分析 - 删除机器人复制以获得更准确的结果

2 个答案:

回答第一个问题

回答第二个问题

回答第三个问题