如何为多个Twitter句柄/用户收集200条以上的推文?

时间:2020-09-04 22:44:30

标签: pandas numpy dataframe twitter tweepy

对于给定数量的用户,我试图收集超过Twitter的200条推文速率限制。

但是,我的代码仅在用户具有200条以下推文时填充数据框,而无法将来自200条以上推文的用户的值附加到数据框。

完整代码 IN:

import tweepy
import pandas as pd
import numpy as np
from datetime import timedelta

handles = ['@MrML16419203', '@d00tn00t']

consumerKey, consumerSecret, accessToken, accessTokenSecret = 'x', 'x', 'x', 'x'
authenticate = tweepy.OAuthHandler(consumerKey, consumerSecret)
authenticate.set_access_token(accessToken, accessTokenSecret)
api_twitter = tweepy.API(authenticate, wait_on_rate_limit=True)

total_tweets = []
def get_tweets(handle):
    batch_count_for_tweet_downloads = 200
    try:
        alltweets = []
        tweets = api_twitter.user_timeline(screen_name=handle,
                                           count=batch_count_for_tweet_downloads,
                                           exclude_replies=True,
                                           include_rts=False,
                                           lang="en",
                                           tweet_mode="extended")
        alltweets.extend(tweets)
        oldest = alltweets[-1].id - 1
        oldest_datetime = pd.to_datetime(str(pd.to_datetime(oldest))[:-10]).strftime("%Y-%m-%d %H:%M:%S")
        print(f"Getting Tweets For " + handle + ", After: " + oldest_datetime)
        
        while len(tweets) > 0:
            tweets = api_twitter.user_timeline(screen_name=handle, count=batch_count_for_tweet_downloads, max_id=oldest)
            alltweets.extend(tweets)
            if len(alltweets) > 0:
                oldest = alltweets[-1].id - 1
            else:
                pass
            print("Count: " + f"...{len(alltweets)} " + handle + " Tweets Downloaded")

        print('---Total Downloaded: ' + str(len(alltweets)) + ' for ' + handle + '---')

        df = pd.DataFrame(data=[tweets.user.screen_name for tweets in alltweets], columns=['Handle'])
        df['Tweets'] = np.array([tweets.full_text for tweets in alltweets])
        df['Date'] = np.array([tweets.created_at - timedelta(hours=4) for tweets in alltweets])
        df['Len'] = np.array([len(tweets.full_text) for tweets in alltweets])
        df['Like_count'] = np.array([tweets.favorite_count for tweets in alltweets])
        df['RT_count'] = np.array([tweets.retweet_count for tweets in alltweets])

        total_tweets.extend(alltweets)

        print("----------Total Tweets Extracted: {}".format(df.shape[0]) + "----------")

    except:
        pass
    return df

df = pd.DataFrame()

for handle in handles:
    df_new = get_tweets(handle)
    df = pd.concat((df, df_new))

print(df)

OUT:

           Handle   Tweets                Date  Len  Like_count  RT_count
0    MrML16419203   132716 2020-09-02 02:18:28  6.0         0.0       0.0
1    MrML16419203   432881 2020-09-02 02:04:23  6.0         0.0       0.0
2    MrML16419203   973625 2020-09-02 02:04:09  6.0         0.0       0.0
3    MrML16419203  1234567 2020-09-02 01:55:10  7.0         0.0       0.0
4    MrML16419203   225865 2020-09-02 01:27:11  6.0         0.0       0.0
..            ...      ...                 ...  ...         ...       ...
536      d00tn00t      NaN                 NaT  NaN         NaN       NaN
537      d00tn00t      NaN                 NaT  NaN         NaN       NaN
538      d00tn00t      NaN                 NaT  NaN         NaN       NaN
539      d00tn00t      NaN                 NaT  NaN         NaN       NaN
540      d00tn00t      NaN                 NaT  NaN         NaN       NaN

您可以看到,即使我的控制台显示while循环正在下载这些数据点,拥有200条以上推文的任何用户仍会返回NaN和NaT值。

我尝试过多种解决方案(例如游标),但都没有用,并且在尝试仅从200条以上推文中提取推文时收到长度不匹配错误。这是因为返回的数据框为空(除了“句柄”列之外),并且在导出为CSV时可以观察到。

任何帮助将不胜感激。谢谢。

0 个答案:

没有答案