Question

我正在使用tweepy来处理推文：

class StdOutListener(StreamListener):
    def on_data(self, data):
        process(json.loads(data))
        return True

l = StdOutListener()
stream = Stream(auth, l)
stream.filter(track=utf_words)

process函数获取包含在推文中的URL（带有请求）的内容，使用nltk处理数据（我猜这会使用一些CPU）并将结果保存到Mongo中。

问题是获取包含的URL的内容需要很长时间，因此限制了我的处理速度。我如何以高速度加速这件事呢？

Answer 1

您可以使用python的'threading'模块：

import threading
class YourThreadSubclass(threading.Thread):
    def __init__(self,your_args):
        threading.Thread.__init__(self)
            #do whatever setup you want
    def run(self):
            process_data(self.some_property)

threads = [YourThreadSubclass(args) for args in Iterable]
for t in threads:
    t.start()
for t in threads:
    t.join()
return reduce(combiner, (t.result_field for t in threads))

此处有更多信息：http://docs.python.org/2/library/threading.html

编辑：更直接地说，只要调用on_data，你就可以分叉一个线程。

def on_data(self, data):
    YourThreadSubclass(data).start()

分叉线程会以异步方式存储其结果。

如果您正在处理大量请求，您可能还想使用线程池来管理线程。文档here

如何并行化Python中的I / O绑定操作？

1 个答案: