Question

背景我有一个python模块设置为从流API中获取JSON对象，并使用pymongo将它们（一次25个批量插入）存储在MongoDB中。为了进行比较，我还从同一个流API获取了curl的bash命令，并pipe向mongoimport添加了一个bash命令。这两种方法都将数据存储在不同的集合中。

我会定期监控收藏品的count()以检查它们的价格。

到目前为止，我看到python模块滞后于curl | mongoimport方法后面的大约1000个JSON对象。

问题： 如何优化python模块以与 curl | mongoimport同步？

我不能使用tweetstream因为我没有使用Twitter API而是第三方流媒体服务。

有人可以帮助我吗？

Python模块：


class StreamReader:
    def __init__(self):
        try:
            self.buff = ""
            self.tweet = ""
            self.chunk_count = 0
            self.tweet_list = []
            self.string_buffer = cStringIO.StringIO()
            self.mongo = pymongo.Connection(DB_HOST)
            self.db = self.mongo[DB_NAME]
            self.raw_tweets = self.db["raw_tweets_gnip"]
            self.conn = pycurl.Curl()
            self.conn.setopt(pycurl.ENCODING, 'gzip')
            self.conn.setopt(pycurl.URL, STREAM_URL)
            self.conn.setopt(pycurl.USERPWD, AUTH)
            self.conn.setopt(pycurl.WRITEFUNCTION, self.handle_data)
            self.conn.perform()
        except Exception as ex:
            print "error ocurred : %s" % str(ex)

    def handle_data(self, data):
        try:
            self.string_buffer = cStringIO.StringIO(data)
            for line in self.string_buffer:
                try:
                    self.tweet = json.loads(line)
                except Exception as json_ex:
                    print "JSON Exception occurred: %s" % str(json_ex)
                    continue

                if self.tweet:
                    try:
                        self.tweet_list.append(self.tweet)
                        self.chunk_count += 1
                        if self.chunk_count % 1000 == 0
                            self.raw_tweets.insert(self.tweet_list)
                            self.chunk_count = 0
                            self.tweet_list = []

                    except Exception as insert_ex:
                        print "Error inserting tweet: %s" % str(insert_ex)
                        continue
        except Exception as ex:
            print "Exception occurred: %s" % str(ex)
            print repr(self.buff)

    def __del__(self):
        self.string_buffer.close()

感谢阅读。

Answer 1

最初您的代码中存在错误。

                if self.chunk_count % 50 == 0
                    self.raw_tweets.insert(self.tweet_list)
                    self.chunk_count = 0

您重置chunk_count但不重置tweet_list。因此，第二次尝试插入100个项目（50个新项目加上50个已经发送到DB的时间）。你已经解决了这个问题，但仍然看到了性能上的差异。

整批大小的东西原来是红鲱鱼。我尝试使用json的大文件并通过python加载它，然后通过mongoimport加载它，Python总是更快（即使在安全模式下 - 见下文）。

仔细研究一下你的代码，我意识到问题在于流API实际上是以块的形式处理你的数据。您需要将这些块放入数据库（这就是mongoimport正在做的事情）。你的python正在做的额外工作是拆分流，将它添加到列表然后定期发送批次到Mongo可能是我看到的和你看到的之间的区别。

尝试使用handle_data（）

的代码段

def handle_data(self, data):
    try:
        string_buffer = StringIO(data)
        tweets = json.load(string_buffer)
    except Exception as ex:
        print "Exception occurred: %s" % str(ex)
    try:
        self.raw_tweets.insert(tweets)
    except Exception as ex:
        print "Exception occurred: %s" % str(ex)

需要注意的一点是你的python inserts are not running in "safe mode" - 你应该通过在insert语句中添加一个参数safe=True来改变它。然后，您将在任何失败的插入上获得异常，并且您的try / catch将打印出现该问题的错误。

性能成本也不高 - 我目前正在进行测试，大约五分钟后，两个系列的尺寸为14120 14113。

Answer 2

摆脱了StringIO库。由于WRITEFUNCTION回调handle_data，在这种情况下，会为每一行调用，只需直接加载JSON即可。但是，有时数据中可能包含两个JSON个对象。对不起，我无法发布我使用的curl命令，因为它包含我们的凭据。但是，正如我所说，这是适用于任何流API的一般问题。


def handle_data(self, buf): 
    try:
        self.tweet = json.loads(buf)
    except Exception as json_ex:
        self.data_list = buf.split('\r\n')
        for data in self.data_list:
            self.tweet_list.append(json.loads(data))

优化：将JSON从Streaming API转储到Mongo

2 个答案: