我目前正在做一个有关从实时推文中查找与交通相关的信息的项目。
此处提供了数据集:Tweets with traffic-related labels。但是,原始数据集仅包含类标签,Tweet ID和原始文本。我想获取有关推文的更多信息,例如创建时间,用户ID等。因此,我考虑使用Tweepy来获取所需的信息。以下给出了我编写的一些代码以获取相关信息:
# This module helps us build the twitter dataset for the following analysis
# Current training dataset only contains tweetID, class label and text
# To derive more interesting research output, more information should be considered
import pandas as pd
import tweepy
import os
import time
import numpy as np
# twitter_credentials save the keys and access tokens
import twitter_credentials
# data_paths saves the local paths needed in this module
import data_paths
# # # # TWITTER AUTHENTICATER # # # #
class TwitterAuthenticator():
def authenticate_twitter_app(self):
auth = tweepy.OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
return auth
# Get relevant tweet information based on tweet ID
class TweetAnalyzer():
"""
Functionality to Build the Twitter Analysis Dataframe
"""
def __init__(self, tweet_id):
self.tweet_id = tweet_id
self.auth = TwitterAuthenticator().authenticate_twitter_app()
def tweets_to_data_frame(self):
api = tweepy.API(self.auth)
columns_order_list = ['user_id', 'tweet_id', 'date', 'Tweets', 'len', 'likes', 'retweets', 'lang',
'location', 'verified']
tweet = api.get_status(self.tweet_id)
df = pd.DataFrame(data=[tweet.text], columns=['Tweets'])
df['tweet_id'] = np.array(tweet.id)
# Count the number of characters in a tweet
df['len'] = np.array(len(tweet.text))
df['date'] = np.array(tweet.created_at)
df['likes'] = np.array(tweet.favorite_count)
df['retweets'] = np.array(tweet.retweet_count)
df['lang'] = np.array(tweet.lang)
df['location'] = np.array(tweet.place)
# get the id account of user
df['user_id'] = np.array(tweet._json['user']['id'])
# check whether an account is verified or not
df['verified'] = np.array(tweet._json['user']['verified'])
df = df[columns_order_list]
return df
def on_errro(self, status):
if status == 420:
# Returning False on on data method in case rate limit occurs
return False
print(status)
if __name__ == "__main__":
training_data = pd.read_csv(os.path.join(data_paths.data_path, '1_TrainingSet_3Class.csv'),
names=['label', 'ID', 'text'])
test_data = pd.read_csv(os.path.join(data_paths.data_path, '1_TestSet_3Class.csv'),
names=['label', 'ID', 'text'])
ids_train = list(training_data['ID'])
ids_test = list(test_data['ID'])
train_dataframes_list = []
test_dataframes_list = []
print('Dealing with training data....')
for index, id in enumerate(ids_train):
true_id = int(id[1:])
print('Training Data: Coping with the {} tweetID: {}'.format(index+1, true_id))
tweet_object = TweetAnalyzer(tweet_id=true_id)
try:
train_dataframes_list.append(tweet_object.tweets_to_data_frame())
continue
except tweepy.error.TweepError:
print('No status found with the ID {}'.format(true_id))
continue
except tweepy.error.RateLimitError:
print('When Index equals {}, we meed to take a break. The ID is {}'.format(index+1, true_id))
time.sleep(60 * 15)
print('Restart coping with tweetID {}'.format(true_id))
train_dataframes_list.append(tweet_object.tweets_to_data_frame())
print('TweetID {} is done. Continue fetching information'.format(true_id))
continue
except StopIteration:
print('The program stops when Index = {} and ID is {}'.format(index+1, true_id))
break
print('Done!')
print('------------------------------------')
print('Dealing with test data....')
for index, id in enumerate(ids_test):
true_id = int(id[1:])
print('Test Data: Coping with the {} tweetID: {}'.format(index + 1, true_id))
tweet_object = TweetAnalyzer(tweet_id=true_id)
try:
train_dataframes_list.append(tweet_object.tweets_to_data_frame())
continue
except tweepy.error.TweepError:
print('No status found with the ID {}'.format(true_id))
continue
except tweepy.error.RateLimitError:
print('When Index equals {}, we meed to take a break. The ID is {}'.format(index + 1, true_id))
time.sleep(60 * 15)
print('Restart coping with tweetID {}'.format(true_id))
train_dataframes_list.append(tweet_object.tweets_to_data_frame())
print('TweetID {} is done. Continue fetching information'.format(true_id))
continue
except StopIteration:
print('The program stops when Index = {} and ID is {}'.format(index + 1, true_id))
break
print('Done!')
print('------------------------------------')
check_dataframe_train = pd.concat(train_dataframes_list, axis=0)
check_dataframe_test = pd.concat(test_dataframes_list, axis=0)
check_dataframe_train.to_pickle(os.path.join(data_paths.desktop, 'train.pkl'))
check_dataframe_test.to_pickle(os.path.join(data_paths.desktop, 'test.pkl'))
但是当我运行此代码时,对于特定的tweet ID(例如872960543077928962),我总是会得到以下信息:
No status found with the ID 872960543077928962
因为tweepy.error.TweepError
被触发。但是基于此stackoverflow question,我可以通过以下链接https://twitter.com/statuses/872960543077928962
因此,我的问题是为什么发生这种tweepy.error.TweepError
?我设置了速率限制例外,我不认为这是因为在这种情况下存在速率限制错误。
此外,我还遇到有关tweepy.error.TweepError的问题。我已经看到了此页面tweepy error response codes。但是,如何使用这些代码指定特定类型的错误?似乎如下代码:
except tweepy.error.TweepError as e:
if e.reason[0]['code']
不起作用,并可能导致错误。因此,如何使用tweepy.error.TweepError指定特定类型的错误?
欢迎任何建议和见解!谢谢!