我正在对推文进行内容分析。我正在使用tweepy返回与某些术语匹配的推文,然后将N量的推文写入CSv文件进行分析。创建文件和获取数据不是问题,但我希望减少数据收集时间。目前,我正在迭代文件中的术语列表。一旦达到N(例如500条推文),它就会移动到下一个过滤词。
我想将所有条款(少于400个)输入到单个变量中,并将所有结果匹配。这也有效。我无法得到的是来自twitter的返回值,表示状态中匹配的术语。
class CustomStreamListener(tweepy.StreamListener):
def __init__(self, output_file, api=None):
super(CustomStreamListener, self).__init__()
self.num_tweets = 0
self.output_file = output_file
def on_status(self, status):
cleaned = status.text.replace('\'','').replace('&','').replace('>','').replace(',','').replace("\n",'')
self.num_tweets = self.num_tweets + 1
if self.num_tweets < 500:
self.output_file.write(topicName + ',' + status.user.location.encode("UTF-8") + ',' + cleaned.encode("UTF-8") + "\n")
print ("capturing tweet number " + str(self.num_tweets) + " for search term: " + topicName)
return True
else:
return False
sys.exit("terminating")
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True #Don't kill the stream
with open('termList.txt', 'r') as f:
topics = [line.strip() for line in f]
for topicName in topics:
stamp = datetime.datetime.now().strftime(topicName + '-%Y-%m-%d-%H%M%S')
with open(stamp + '.csv', 'w+') as topicFile:
sapi = tweepy.streaming.Stream(auth, CustomStreamListener(topicFile))
sapi.filter(track=[topicName])
具体来说,我的问题是这个。如果轨道变量有多个条目,如何获得匹配的内容?我还要说我对python和tweepy比较新。
提前感谢您的任何建议和帮助!
答案 0 :(得分:0)
您可以根据匹配条款查看推文文字。类似的东西:
>>> a = "hello this is a tweet"
>>> terms = [ "this "]
>>> matches = []
>>> for i, term in enumerate( terms ):
... if( term in a ):
... matches.append( i )
...
>>> matches
[0]
>>>
这将为您提供与特定推文 a 相匹配的所有条款。在这种情况下只是&#34;这个&#34;术语