Twitter提要通过间歇性更新持续收集数据

时间:2016-01-21 18:24:55

标签: python twitter nltk tweepy

我正在收集推特数据。我按指定的术语“生日”过滤流。我希望能够在不中断数据的情况下同时计算频率。 现在它只是印刷所有东西。

如果我要创建另一个可以计算单词频率的函数def data_processing(),如何从def on_data(self,data)中访问z = nltk.word_tokenize(提取)? < / p>

    import tweepy
    import json
    import nltk
    import time

    # counting
    import numpy
    from collections import Counter



    # Authentication details. To  obtain these visit dev.twitter.com

    consumer_key = XXX
    consumer_secret = XXX
    access_token = XXX
    access_token_secret = XXX
    sequence=[]

    # This is the listener, resposible for receiving data
    class StdOutListener(tweepy.StreamListener):



    def on_data(self, data):
     # Twitter returns data in JSON format - we need to decode it first
    decoded = json.loads(data)


    # Also, we convert UTF-8 to ASCII ignoring all bad characters sent         by users
   decoded['text'].encode('latin1', 'ignore')))

            extracted=decoded['text'].encode('ascii', 'ignore').decode('ascii')
            z= nltk.word_tokenize(extracted)


                    print(z)
            return True


    if __name__ == '__main__':

        l = StdOutListener()
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_token, access_token_secret)

        print ("Showing all new tweets for #programming:")

        # There are different kinds of streams: public stream, user stream, multi-user streams
        # In this example follow #programming tag
        # For more details refer to https://dev.twitter.com/docs/streaming-apis
        stream = tweepy.Stream(auth, l, timeout=60)
        stream.filter(track=['birthday'])

2 个答案:

答案 0 :(得分:0)

print函数中的on_data

当然,每次收到数据时都会打印出来。

相反,删除print,等到5分钟结束,停止流,然后在流停止后打印。

答案 1 :(得分:0)

使用

import time
time.sleep(5*60) 

暂停执行5分钟