我正在收集推特数据。我按指定的术语“生日”过滤流。我希望能够在不中断数据的情况下同时计算频率。 现在它只是印刷所有东西。
如果我要创建另一个可以计算单词频率的函数def data_processing(),如何从def on_data(self,data)中访问z = nltk.word_tokenize(提取)? < / p>
import tweepy
import json
import nltk
import time
# counting
import numpy
from collections import Counter
# Authentication details. To obtain these visit dev.twitter.com
consumer_key = XXX
consumer_secret = XXX
access_token = XXX
access_token_secret = XXX
sequence=[]
# This is the listener, resposible for receiving data
class StdOutListener(tweepy.StreamListener):
def on_data(self, data):
# Twitter returns data in JSON format - we need to decode it first
decoded = json.loads(data)
# Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
decoded['text'].encode('latin1', 'ignore')))
extracted=decoded['text'].encode('ascii', 'ignore').decode('ascii')
z= nltk.word_tokenize(extracted)
print(z)
return True
if __name__ == '__main__':
l = StdOutListener()
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
print ("Showing all new tweets for #programming:")
# There are different kinds of streams: public stream, user stream, multi-user streams
# In this example follow #programming tag
# For more details refer to https://dev.twitter.com/docs/streaming-apis
stream = tweepy.Stream(auth, l, timeout=60)
stream.filter(track=['birthday'])
答案 0 :(得分:0)
print
函数中的on_data
。
当然,每次收到数据时都会打印出来。
相反,删除print
,等到5分钟结束,停止流,然后在流停止后打印。
答案 1 :(得分:0)
使用
import time
time.sleep(5*60)
暂停执行5分钟