有关基于tweetID提取tweet信息的问题

时间:2019-06-02 06:01:55

标签: python pandas tweepy social-media tweetstream

我目前正在做一个有关从实时推文中查找与交通相关的信息的项目。

此处提供了数据集:Tweets with traffic-related labels。但是,原始数据集仅包含类标签,Tweet ID和原始文本。我想获取有关推文的更多信息,例如创建时间,用户ID等。因此,我考虑使用Tweepy来获取所需的信息。以下给出了我编写的一些代码以获取相关信息:

# This module helps us build the twitter dataset for the following analysis
# Current training dataset only contains tweetID, class label and text
# To derive more interesting research output, more information should be considered
import pandas as pd
import tweepy
import os
import time
import numpy as np

# twitter_credentials save the keys and access tokens
import twitter_credentials
# data_paths saves the local paths needed in this module
import data_paths


# # # # TWITTER AUTHENTICATER # # # #
class TwitterAuthenticator():

    def authenticate_twitter_app(self):
        auth = tweepy.OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
        auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
        return auth

# Get relevant tweet information based on tweet ID
class TweetAnalyzer():
    """
    Functionality to Build the Twitter Analysis Dataframe
    """
    def __init__(self, tweet_id):
        self.tweet_id = tweet_id
        self.auth = TwitterAuthenticator().authenticate_twitter_app()

    def tweets_to_data_frame(self):
        api = tweepy.API(self.auth)
        columns_order_list = ['user_id', 'tweet_id', 'date', 'Tweets', 'len', 'likes', 'retweets', 'lang',
                              'location', 'verified']
        tweet = api.get_status(self.tweet_id)
        df = pd.DataFrame(data=[tweet.text], columns=['Tweets'])
        df['tweet_id'] = np.array(tweet.id)
        # Count the number of characters in a tweet
        df['len'] = np.array(len(tweet.text))
        df['date'] = np.array(tweet.created_at)
        df['likes'] = np.array(tweet.favorite_count)
        df['retweets'] = np.array(tweet.retweet_count)
        df['lang'] = np.array(tweet.lang)
        df['location'] = np.array(tweet.place)
        # get the id account of user
        df['user_id'] = np.array(tweet._json['user']['id'])
        # check whether an account is verified or not
        df['verified'] = np.array(tweet._json['user']['verified'])
        df = df[columns_order_list]
        return df

    def on_errro(self, status):
        if status == 420:
            # Returning False on on data method in case rate limit occurs
            return False
        print(status)


if __name__ == "__main__":

    training_data = pd.read_csv(os.path.join(data_paths.data_path, '1_TrainingSet_3Class.csv'),
                                   names=['label', 'ID', 'text'])
    test_data = pd.read_csv(os.path.join(data_paths.data_path, '1_TestSet_3Class.csv'),
                            names=['label', 'ID', 'text'])
    ids_train = list(training_data['ID'])
    ids_test = list(test_data['ID'])
    train_dataframes_list = []
    test_dataframes_list = []
    print('Dealing with training data....')
    for index, id in enumerate(ids_train):
        true_id = int(id[1:])
        print('Training Data: Coping with the {} tweetID: {}'.format(index+1, true_id))
        tweet_object = TweetAnalyzer(tweet_id=true_id)
        try:
            train_dataframes_list.append(tweet_object.tweets_to_data_frame())
            continue
        except tweepy.error.TweepError:
            print('No status found with the ID {}'.format(true_id))
            continue
        except tweepy.error.RateLimitError:
            print('When Index equals {}, we meed to take a break. The ID is {}'.format(index+1, true_id))
            time.sleep(60 * 15)
            print('Restart coping with tweetID {}'.format(true_id))
            train_dataframes_list.append(tweet_object.tweets_to_data_frame())
            print('TweetID {} is done. Continue fetching information'.format(true_id))
            continue
        except StopIteration:
            print('The program stops when Index = {} and ID is {}'.format(index+1, true_id))
            break

    print('Done!')
    print('------------------------------------')
    print('Dealing with test data....')
    for index, id in enumerate(ids_test):
        true_id = int(id[1:])
        print('Test Data: Coping with the {} tweetID: {}'.format(index + 1, true_id))
        tweet_object = TweetAnalyzer(tweet_id=true_id)
        try:
            train_dataframes_list.append(tweet_object.tweets_to_data_frame())
            continue
        except tweepy.error.TweepError:
            print('No status found with the ID {}'.format(true_id))
            continue
        except tweepy.error.RateLimitError:
            print('When Index equals {}, we meed to take a break. The ID is {}'.format(index + 1, true_id))
            time.sleep(60 * 15)
            print('Restart coping with tweetID {}'.format(true_id))
            train_dataframes_list.append(tweet_object.tweets_to_data_frame())
            print('TweetID {} is done. Continue fetching information'.format(true_id))
            continue
        except StopIteration:
            print('The program stops when Index = {} and ID is {}'.format(index + 1, true_id))
            break
    print('Done!')
    print('------------------------------------')
    check_dataframe_train = pd.concat(train_dataframes_list, axis=0)
    check_dataframe_test = pd.concat(test_dataframes_list, axis=0)
    check_dataframe_train.to_pickle(os.path.join(data_paths.desktop, 'train.pkl'))
    check_dataframe_test.to_pickle(os.path.join(data_paths.desktop, 'test.pkl'))

但是当我运行此代码时,对于特定的tweet ID(例如872960543077928962),我总是会得到以下信息:

No status found with the ID 872960543077928962

因为tweepy.error.TweepError被触发。但是基于此stackoverflow question,我可以通过以下链接https://twitter.com/statuses/872960543077928962

访问此推文

因此,我的问题是为什么发生这种tweepy.error.TweepError?我设置了速率限制例外,我不认为这是因为在这种情况下存在速率限制错误。

此外,我还遇到有关tweepy.error.TweepError的问题。我已经看到了此页面tweepy error response codes。但是,如何使用这些代码指定特定类型的错误?似乎如下代码:

except tweepy.error.TweepError as e:
if e.reason[0]['code']

不起作用,并可能导致错误。因此,如何使用tweepy.error.TweepError指定特定类型的错误?

欢迎任何建议和见解!谢谢!

0 个答案:

没有答案