在Tweepy中循环后另存为DataFrame,无循环工作,添加循环后另存为列表

时间:2019-01-13 17:05:43

标签: python pandas loops numpy tweepy

问题:在Twitter上拉多个用户时间轴以另存为DataFrame。

以下是一次完美的解决方案,一次仅适用于一个用户:

import tweepy
import pandas as pd
import numpy as np

ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""
CONSUMER_KEY = ""
CONSUMER_SECRET = ""

# OAuth process, using the keys and tokens
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# Creation of the actual interface, using authentication
api = tweepy.API(auth, wait_on_rate_limit=True)


# Running only on handle returns a dataframe 
tweets = api.user_timeline(screen_name='pycon', count=10)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns= ['Tweets'])
data['len']  = np.array([len(tweet.text) for tweet in tweets])
data['ID']   = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes']  = np.array([tweet.favorite_count for tweet in tweets])
data['RTs']    = np.array([tweet.retweet_count for tweet in tweets])

print(data)

以上方法效果很好,并且将在DataFrame中返回用户pycon的10条最新推文。下一步是添加多个要查询的句柄。这是使用多个句柄执行相同操作的代码:

#Added list of handles
handles = ['pycon', 'gvanrossum']
#Added Empty DF to fill
test = []
#Added loop
for handle in handles:
    tweets = api.user_timeline(screen_name=handle, count=10)
    print("Number of tweets extracted: {}.\n".format(len(tweets)))
    data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
    data['len']  = np.array([len(tweet.text) for tweet in tweets])
    data['ID']   = np.array([tweet.id for tweet in tweets])
    data['Date'] = np.array([tweet.created_at for tweet in tweets])
    data['Source'] = np.array([tweet.source for tweet in tweets])
    data['Likes']  = np.array([tweet.favorite_count for tweet in tweets])
    data['RTs']    = np.array([tweet.retweet_count for tweet in tweets])
    test.append(data)

print(test)

运行此命令将提供两个输出。 data将是一个包含gvanrossum最近10条tweet的DataFrame(这是句柄列表中的第二个句柄,这很有意义)。第二个输出将是test,这是一个列表。有趣的是,test具有pycongvansossum的全部20条推文,但采用列表形式。该循环正在工作,但没有另存为DataFrame。

问题:如何将多个句柄之间的循环保存为DataFrame?

1 个答案:

答案 0 :(得分:1)

如果您要将数据存储在单个数据库中

merged=pd.DataFrame()
#Added loop
for handle in handles:
    tweets = api.user_timeline(screen_name=handle, count=10)
    print("Number of tweets extracted: {}.\n".format(len(tweets)))
    data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
    data['len']  = np.array([len(tweet.text) for tweet in tweets])
    data['ID']   = np.array([tweet.id for tweet in tweets])
    data['Date'] = np.array([tweet.created_at for tweet in tweets])
    data['Source'] = np.array([tweet.source for tweet in tweets])
    data['Likes']  = np.array([tweet.favorite_count for tweet in tweets])
    data['RTs']    = np.array([tweet.retweet_count for tweet in tweets])
    #created new column handle to identify the source of tweet. Can comment if you do not need.
    data.loc['Handle',:]=handle
    #merging the data frames
    merged=pd.concat([merged,data])
print(merged)