在实时的推文流中跟踪关键字

时间:2012-08-08 02:22:20

标签: python twitter tweepy

我安装并试用了tweepy,我现在正在使用以下功能:

来自API Reference

  

API.public_timeline()

     

返回20个最近的状态   已设置自定义用户图标的未受保护用户。公众   时间轴缓存60秒,因此请求更频繁   这是浪费资源。

但是,我想从完整的直播中提取所有匹配某个正则表达式的推文。我可以将public_timeline()置于while True循环内,但这可能会遇到速率限制问题。无论哪种方式,我都不认为它可以覆盖所有当前的推文。

怎么可能这样呢?如果不是所有的推文,那么我想提取与特定关键字匹配的尽可能多的推文。

2 个答案:

答案 0 :(得分:2)

流媒体API就是您想要的。我使用一个名为tweetstream的库。这是我的基本听力功能:

def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them 
to the database population function on a per-instance basis, so as to avoid disaster 
if the stream is disconnected.

Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""   
filters = []
for key in args:
    filters.append(str(key))
if len(filters) == 0:
    stream = tweetstream.SampleStream(username, password)  
else:
    stream = tweetstream.FilterStream(username, password, track=filters)
try:
    count = 0
    while count < numtweets:       
        for tweet in stream:
            # a check is needed on text as some "tweets" are actually just API operations
            # the language selection doesn't really work but it's better than nothing(?)
            if tweet.get('text') and tweet['user']['lang'] == 'en':   
                if tweet['retweet_count'] == 0:
                    # bundle up the features I want and send them to the db population function
                    bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
                    db_initpop(bundle)
                    break
                else:
                    # a RT has a different structure.  This bundles the original tweet.  Getting  the
                    # retweets comes later, after the stream is de-accessed.
                    bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
                              tweet['retweet_count'], tweet['retweeted_status']['text'])
                    db_initpop(bundle)
                    break
        count += 1
except tweetstream.ConnectionError, e:
    print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
    +'.  Reason: ', e.reason

我没有看过一段时间,但我很确定这个库只是访问样本流(而不是firehose)。 HTH。

编辑添加:你说你想要“完整的实时流”,也就是firehose。这在财政和技术上都很昂贵,只允许非常大的公司拥有它。查看文档,你会发现样本基本上具有代表性。

答案 1 :(得分:1)

看看streaming API。您甚至可以订阅您定义的单词列表,只返回与这些单词匹配的推文。

流API速率限制的工作方式不同:每个IP获得1个连接,每秒最大事件数。如果发生的事件多于此数,那么您只会获得最大值,并通知您因速率限制而错过了多少事件。

我的理解是,流API最适合于根据需要将内容重新分发给用户的服务器,而不是由用户直接访问 - 常设连接很昂贵,并且在太多连接失败后Twitter开始将IP列入黑名单和重新连接,以及之后的API密钥。