我在一系列extract ids
中向usernames
提供了以下代码,并附加到名为new_followers_df
的pandas数据框中:
twitter_handles = ["x", "y"]
## Import New Twitter Followers
new_follower_ids = []
handles = []
for user in twitter_handles:
while True:
try:
for page in tweepy.Cursor(api.followers_ids, screen_name= user).pages():
new_follower_ids.extend(page)
for ids in page:
handles.append(user)
except tweepy.TweepError:
time.sleep(60 * 15)
continue
except StopIteration:
pass
break
new_followers_df = pd.DataFrame({
"Handles": handles,
"Follower_ID": new_follower_ids})
如果user x
有75,000 users
而user y
另有75,000
我计算了它,我应该30 minutes
抓取所有user X and Y's followers
。
这是因为API的限制为5000 ids per Cursor
,15 calls per session
和15 minute wait in between
。
但是,出于某种原因,脚本需要更长的时间才能完成。知道我的for循环中有什么问题吗?可能与以下内容有关:StopIteration
?
由于
答案 0 :(得分:1)
可能会发生一些事情。
pandas
可能需要一些时间才能将150,000个值附加到Dataframe。page
是生成器,那么您使用extend(page)
两次(for ids in page
然后page
)可能会使用两次调用。这有点猜测,我可能完全错了。然而,您可以重新编码,以便更优雅地工作,并希望减少您获得的慢速时间。
首先,您不必自己处理速率限制。初始化API时,tweepy
可以执行此操作。大概在代码中的某个时刻你就有了这句话:
api = tweepy.API(auth)
如果我们将其更改为:
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
当你达到速率限制时, tweepy
会等待,并会打印一条消息告诉你它正在等待。
一旦你有了这个,让我们稍微重新编写你的代码:
twitter_handles = ["x", "y"]
new_follower_ids = []
handles = []
for user in twitter_handles:
current_user_followers = []
for page in tweepy.Cursor(api.followers_ids, screen_name=user).pages():
current_user_followers.extend(page)
new_follower_ids.extend(current_user_followers)
handles.extend([user for _ in current_user_followers])
new_followers_df = pd.DataFrame({
"Handles": handles,
"Follower_ID": new_follower_ids})
通过跟踪for循环中当前用户的关注者,一旦我们获得了所有新关注者,我们只需要在最后扩展handles
列表一次。由于我们知道此用户拥有多少粉丝,因此我们可以为user
追加handles
一次追随者。
答案 1 :(得分:0)
import tweepy
from datetime import datetime
import pandas as pd
new_followers_df = pd.DataFrame()
def download_followers(user, api):
all_followers = []
try:
for page in tweepy.Cursor(api.followers_ids, screen_name=user).pages():
all_followers.extend(map(str, page))
return all_followers
except tweepy.TweepError:
print('Could not access user {}. Skipping...'.format(user))
# Include your keys below:
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
# Set up tweepy API, with handling of rate limits
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
main_api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
# List of usernames to get followers for
lookup_users = ['x',
'y',
'z',
'a',
'b']
for username in lookup_users:
user_followers = download_followers(username, main_api)
if user_followers:
new_followers = pd.DataFrame({
"Handles": username,
"Follower_ID": user_followers,
"Start_Date": datetime.now().strftime('%Y/%m/%d')})
new_followers_df = new_followers_df.append(new_followers)
print('Finished outputting: {} at {}'.format(username, datetime.now().strftime('%Y/%m/%d %H:%M:%S')))