Question

我正在收集推特数据。我按指定的术语“生日”过滤流。我希望能够在不中断数据的情况下同时计算频率。现在它只是印刷所有东西。

如果我要创建另一个可以计算单词频率的函数def data_processing（），如何从def on_data（self，data）中访问z = nltk.word_tokenize（提取）？ < / p>

    import tweepy
    import json
    import nltk
    import time

    # counting
    import numpy
    from collections import Counter



    # Authentication details. To  obtain these visit dev.twitter.com

    consumer_key = XXX
    consumer_secret = XXX
    access_token = XXX
    access_token_secret = XXX
    sequence=[]

    # This is the listener, resposible for receiving data
    class StdOutListener(tweepy.StreamListener):



    def on_data(self, data):
     # Twitter returns data in JSON format - we need to decode it first
    decoded = json.loads(data)


    # Also, we convert UTF-8 to ASCII ignoring all bad characters sent         by users
   decoded['text'].encode('latin1', 'ignore')))

            extracted=decoded['text'].encode('ascii', 'ignore').decode('ascii')
            z= nltk.word_tokenize(extracted)


                    print(z)
            return True


    if __name__ == '__main__':

        l = StdOutListener()
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_token, access_token_secret)

        print ("Showing all new tweets for #programming:")

        # There are different kinds of streams: public stream, user stream, multi-user streams
        # In this example follow #programming tag
        # For more details refer to https://dev.twitter.com/docs/streaming-apis
        stream = tweepy.Stream(auth, l, timeout=60)
        stream.filter(track=['birthday'])

Answer 1

print函数中的on_data。

当然，每次收到数据时都会打印出来。

相反，删除print，等到5分钟结束，停止流，然后在流停止后打印。

Answer 2

使用

import time
time.sleep(5*60)

暂停执行5分钟

Twitter提要通过间歇性更新持续收集数据

2 个答案: