Question

我有一个项目，我会下载过去一年发送给名人的所有推文，并对他们进行情绪分析，并评估谁是最积极的粉丝。

然后我发现您可以使用tweepy / twitter API在最近7天内检索Twitter提及。我清理了网，但在过去的一年里找不到任何下载推文的方法。

无论如何，我决定仅在过去7天内完成该项目的数据并编写以下代码：

try:
    while 1:
        for results in tweepy.Cursor(twitter_api.search, q="@celebrity_handle").items(9999999):
            item = (results.text).encode('utf-8').strip()
            wr.writerow([item, results.created_at])  # write to a csv (tweet, date)

我正在使用Cursor搜索API，因为other获取提及的方式（更准确的方法）仅限于检索最后800条推文。

无论如何，在一夜之间运行代码之后，我只能下载32K的推文。其中约90％是转推。

是否有更好的方法来提取数据？

请记住：

我想为多个名人做这件事。（着名的数百万粉丝）。
我不关心转推。
他们每天都有成千上万的推文发送给他们。

欢迎提出任何建议，但目前我的想法不合时宜。

Answer 1

我会使用搜索API。我用以下代码做了类似的事情。它似乎完全符合预期。我在一个特定的电影明星上使用它，并通过快速扫描拉出15568条推文，所有这些都是@mentions。（我从他们的整个时间表中撤出。）

在您的情况下，在您想要运行的搜索上，比如每天，我会存储您为每个用户提取的最后一次提及的ID，并在每次重新运行时将该值设置为“sinceId”搜索范围。

另外，AppAuthHandler比OAuthHandler快得多，您不需要对这些类型的数据提取进行用户身份验证。

auth = tweepy.AppAuthHandler(consumer_token, consumer_secret)
auth.secure = True
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

searchQuery = '@username'这就是我们正在寻找的东西。在你的情况下，我会创建一个列表并遍历搜索查询运行的每个传递中的所有用户名。

retweet_filter='-filter:retweets'这会过滤掉转推

在下面的每个api.search调用中，我将把以下内容作为查询参数：

q=searchQuery+retweet_filter

以下代码（以及上面的api设置）来自this link：

tweetsPerQry = 100＃这是API允许的最大值

fName = 'tweets.txt'＃我们会将推文存储在一个文本文件中。

如果需要来自特定ID的结果，请将sinceId设置为该ID。否则默认为没有下限，可以追溯到API允许

sinceId = None

如果结果仅低于特定ID，请将max_id设置为该ID。否则默认为无上限，从与搜索查询匹配的最新推文开始。

max_id = -1L
//however many you want to limit your collection to.  how much storage space do you have?
maxTweets = 10000000 

tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
    while tweetCount < maxTweets:
        try:
            if (max_id <= 0):
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            since_id=sinceId)
            else:
                if (not sinceId):
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1))
                else:
                    new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                            max_id=str(max_id - 1),
                                            since_id=sinceId)
            if not new_tweets:
                print("No more tweets found")
                break
            for tweet in new_tweets:
                f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
                        '\n')
            tweetCount += len(new_tweets)
            print("Downloaded {0} tweets".format(tweetCount))
            max_id = new_tweets[-1].id
        except tweepy.TweepError as e:
            # Just exit if any error
            print("some error : " + str(e))
            break

print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))

对于拥有数百万粉丝的用户，请使用tweepy获取所有Twitter提及

1 个答案: