Question

我使用tweepy对关键字的公共推文流进行数据处理。这非常简单，已在多个地方进行了描述：

http://runnable.com/Us9rrMiTWf9bAAW3/how-to-stream-data-from-twitter-with-tweepy-for-python

http://adilmoujahid.com/posts/2014/07/twitter-analytics/

直接从第二个链接复制代码：

#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#Variables that contains the user credentials to access Twitter API 
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"


#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status


if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    stream.filter(track=['python', 'javascript', 'ruby'])

我能弄清楚的是如何将这些数据流式传输到python变量？而不是将其打印到屏幕上...我在ipython中工作笔记本电脑，并希望在流媒体播放一分钟左右后，在某个变量foo中捕获流。此外，如何让流超时？它以这种方式无限期地运行。

相关：

Using tweepy to access Twitter's Streaming API

Answer 1

是的，在帖子中，@ Adil Moujahid提到他的代码运行了3天。我调整了相同的代码并进行了初步测试，做了以下调整：

a）添加了一个位置过滤器，以获取有限的推文，而不是包含该关键字的通用推文。见How to add a location filter to tweepy module。从这里，您可以在上面的代码中创建一个中间变量，如下所示：

stream_all = Stream(auth, l)

假设我们选择旧金山地区，我们可以添加：

stream_SFO = stream_all.filter(locations=[-122.75,36.8,-121.75,37.8])

假设过滤位置的时间小于关键字的过滤时间。

（b）然后您可以过滤关键字：

tweet_iter = stream_SFO.filter(track=['python', 'javascript', 'ruby'])

（c）然后您可以按如下方式将其写入文件：

with open('file_name.json', 'w') as f:
        json.dump(tweet_iter,f,indent=1)

这应该花费更少的时间。我非常想提出你今天发布的同一个问题。因此，我没有执行时间。

希望这有帮助。

Answer 2

我注意到您正在寻找将数据流化为变量以供以后使用的方法。我这样做的方法是创建一种使用sqlite3和sqlalchemy将数据流式传输到数据库中的方法。

例如，首先是常规代码：

while True:
    try:
        driver.get('http://www.website.com')
        time.sleep(7)
        if driver.find_element_by_xpath("//button[@title='Dogs']"): 
            driver.find_element_by_xpath('/html/body/div[5]/section/div/div[1]/div[2]/div/div/div/div/div/div/div[6]/div/button').click()
            break
        elif driver.find_element_by_id('labeled-input-animals'):
            activity = driver.find_element_by_id('labeled-input-animals')
            activity.send_keys("Husky")
            driver.find_element_by_xpath('/html/body/div[4]/div/div[2]/div/div/div[2]/form/div/div[3]/button').click()
            time.sleep(3)
            break
    except:
        print("Searching")

如您在代码中所见，我们进行身份验证并创建一个侦听器，然后激活流

import tweepy
import json
import time
import db_commands
import credentials

API_KEY = credentials.ApiKey 
API_KEY_SECRET = credentials.ApiKeySecret
ACCESS_TOKEN = credentials.AccessToken
ACCESS_TOKEN_SECRET = credentials.AccessTokenSecret

def create_auth_instance():
    """Set up Authentication Instance"""
    auth = tweepy.OAuthHandler(API_KEY, API_KEY_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
    api = tweepy.API(auth, wait_on_rate_limit = True)

    return api

class MyStreamListener(tweepy.StreamListener):
    """ Listen for tweets """
    def __init__(self, api=None):
        self.counter = 0
        # References the auth instance for the listener
        self.api = create_auth_instance()
        # Creates a database command instance
        self.dbms = db_commands.MyDatabase(db_commands.SQLITE, dbname='mydb.sqlite')
        # Creates a database table
        self.dbms.create_db_tables()


    def on_connect(self):
        """Notify when user connected to twitter"""
        print("Connected to Twitter API!")


    def on_status(self, tweet):
        """
        Everytime a tweet is tracked, add the contents of the tweet,
        its username, text, and date created, into a sqlite3 database
        """         
        user = tweet.user.screen_name
        text = tweet.text
        date_created = tweet.created_at

        self.dbms.insert(user, text, date_created)


    def on_error(self, status_code):
        """Handle error codes"""
        if status_code == 420:
            # Return False if stream disconnects
            return False  

def main():
    """Create twitter listener (Stream)"""
    tracker_subject = input("Type subject to track: ")
    twitter_listener = MyStreamListener()
    myStream = tweepy.Stream(auth=twitter_listener.api.auth, listener=twitter_listener)
    myStream.filter(track=[tracker_subject], is_async=True)


main()

每次收到推文时，都会执行“ on_status”功能，该功能可用于对正在流式传输的推文数据执行一组操作。

twitter_listener = MyStreamListener()
myStream = tweepy.Stream(auth=twitter_listener.api.auth, listener=twitter_listener)
myStream.filter(track=[tracker_subject], is_async=True)

tweet数据 tweet 被捕获到三个变量 user，text，date_created 中，然后引用在MyStreamListener类的 init 功能。从导入的db_commands文件中调用此 insert 函数。

这是 db_commands.py 文件中的代码，该文件通过 import db_commands 导入到代码中。

def on_status(self, tweet): """ Everytime a tweet is tracked, add the contents of the tweet, its username, text, and date created, into a sqlite3 database """ user = tweet.user.screen_name text = tweet.text date_created = tweet.created_at self.dbms.insert(user, text, date_created)

此代码使用 sqlalchemy 包创建一个sqlite3数据库并将推文发布到 tweets 表中。可以使用 pip install sqlalchemy 轻松安装Sqlalchemy。如果同时使用这两个代码，则应该能够通过过滤器将tweet爬到数据库中。请告诉我是否有帮助，还有其他疑问。

Tweepy：流数据X分钟？

相关：

2 个答案: