我正在尝试获取以下代码,以排除包含列表中限制词的任何推文。这样做的最佳方式是什么?
这个代码在我退出流时也只返回最后的推文。有没有办法将所有适用的推文打印到CSV?
import sys
import tweepy
import csv
#pass security information to variables
consumer_key = ''
consumer_secret = ''
access_key = ''
access_secret = ''
#use variables to access twitter
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)
#create an object called 'customStreamListener'
class CustomStreamListener(tweepy.StreamListener):
def on_status(self, status):
print (status.author.screen_name, status.created_at, status.text)
# Writing status data
with open('OutputStreaming.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow([status.author.screen_name, status.created_at, status.text])
def on_error(self, status_code):
print >> sys.stderr, 'Encountered error with status code:', status_code
return True # Don't kill the stream
def on_timeout(self):
print >> sys.stderr, 'Timeout...'
return True # Don't kill the stream
# Writing csv titles
with open('OutputStreaming.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(['Author', 'Date', 'Text'])
streamingAPI = tweepy.streaming.Stream(auth, CustomStreamListener())
streamingAPI.filter(track=['Hasbro', 'Mattel', 'Lego'])
答案 0 :(得分:0)
Twitter API中的documentation for the track parameter表示无法从过滤器中排除术语,只能包含单词和短语。您必须在代码中实施额外的过滤器,以丢弃包含您不希望包含在结果集中的字词的推文。
答案 1 :(得分:0)
不可能从过滤器功能中排除术语,但您可以实现自定义选择。 基本上我的想法是检查推文的单词是否包含不允许的单词。 您可以使用nltk模块简单地对推文的文本进行标记。
来自nltk主页的一个简单示例:
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
不经意地,在您的情况下sentence
是tweet.text。
所以用类似的东西改变你的代码:
def on_status(self, status):
print (status.author.screen_name, status.created_at, status.text)
is_allowed = True
banned_words = ['word_1', 'word2', 'another_bad_word']
words_text = nltk.word_tokenize(status.text)
# loop banned_words and search if item is in words_text
for word in banned_words:
if word in words_text:
# discard this tweet
is_allowed = False
break
if is_allowed is True:
# stuff for writing status data
# ...
此代码尚未经过测试,但为您提供了实现目标的方法。
让我知道