如何在保持订单的同时在发电机上使用螺纹?

时间:2018-02-10 19:57:21

标签: python multithreading python-2.7 generator python-multithreading

我有一个简单的代码,它为我正在努力加速的生成器中的每个项目运行GET请求:

def stream(self, records):
    # type(records) = <type 'generator'>
    for record in records:
        # record = OrderedDict([('_time', '1518287568'), ('data', '5552267792')])
        output = rest_api_lookup(record[self.input_field])

        record.update(output)
        yield record

现在,它在单个线程上运行,并且由于每个REST调用都等待直到上一个REST调用结束,因此需要永久。

在使用这个很棒的答案(https://stackoverflow.com/a/28463266/1150923)之前,我已经在列表中使用了Python中的多线程,但是我不确定如何在生成器而不是列表上重用相同的策略。

我从一位开发人员那里得到了一些建议,他建议我将生成器分解为100个元素列表,然后关闭池,但我不知道如何从生成器创建这些列表。

我还需要保留原始订单,因为我需要按正确的顺序yield record

3 个答案:

答案 0 :(得分:2)

我假设您不想先将生成器records转换为列表。加快处理速度的一种方法是将记录传递到ThreadPoolExecutor块中。执行程序将同时处理您的rest_api_lookup块的所有项目。然后你只需要“取消”你的结果。这是一些正在运行的示例代码(不使用类,抱歉,但我希望它显示原理):

from concurrent.futures import ThreadPoolExecutor
from time import sleep

pool = ThreadPoolExecutor(8) # 8 threads, adjust to taste and # of cores

def records():
    # simulates records generator
    for i in range(100):
        yield {'a': i}


def rest_api_lookup(a):
    # simulates REST call :)
    sleep(0.1)
    return {'b': -a}


def stream(records):
    def update_fun(record):
        output = rest_api_lookup(record['a'])
        record.update(output)
        return record
    chunk = []
    for record in records:
        # submit update_fun(record) into pool, keep resulting Future
        chunk.append(pool.submit(update_fun, record))
        if len(chunk) == 8:
            yield chunk
            chunk = []
    if chunk:
        yield chunk

def unchunk(chunk_gen):
    """Flattens a generator of Future chunks into a generator of Future results."""
    for chunk in chunk_gen:
        for f in chunk:
            yield f.result() # get result from Future

# Now iterate over all results in same order as generated by records()    
for result in unchunk(stream(records())):
    print(result)

HTH!

更新:我在模拟的REST调用中添加了sleep,以使其更加真实。这个分块版本在1.5秒内在我的机器上完成。顺序版本需要10秒(如预期的那样,100 * 0.1s = 10s)。

答案 1 :(得分:1)

以下是如何使用concurrent.futures

执行此操作的示例
from concurrent import futures
from concurrent.futures import ThreadPoolExecutor

class YourClass(object):

    def stream(self, records):
        for record in records:
            output = rest_api_lookup(record[self.input_field])
            record.update(output)
        # process your list and yield back result.
        yield {"result_key": "whatever the result is"}

    def run_parallel(self):
        """ Use this method to do the parallel processing """

        # The important part - concurrent futures 
        # - set number of workers as the number of jobs to process - suggest 4, but may differ 
        #   this will depend on how many threads you want to run in parallel
        with ThreadPoolExecutor(4) as executor:
            # Use list jobs for concurrent futures
            # Use list scraped_results for results
            jobs = []
            parallel_results = []

            # Pass some keyword arguments if needed - per job  
            record1 = {} # your values for record1 - if need more - create
            record2 = {} # your values for record2 - if need more - create
            record3 = {} # your values for record3 - if need more - create
            record4 = {} # your values for record4 - if need more - create

            list_of_records = [[record1, record2], [record3, record4],]


            for records in list_of_records:
                # Here we iterate 'number of records' times, could be different
                # We're adding stream, could be different function per call
                jobs.append(executor.submit(self.stream, records))

            # Once parallel processing is complete, iterate over results 
            # append results to final processing without any networking
            for job in futures.as_completed(jobs):
                # Read result from future
                result = job.result()
                # Append to the list of results
                parallel_results.append(result)
            # Use sorted to sort by key to preserve order
            parallel_results = sorted(parallel_results, key=lambda k: k['result_key']) 
            # Iterate over results streamed and do whatever is needed
            for result is parallel_results:
                print("Do something with me {}".format(result))

答案 2 :(得分:0)

The answer

dnswlt 运行良好,但仍有待改进。如果对 REST API 的请求(或其他任何应该对每条记录执行的操作)花费的时间不定,则当每个批次中最慢的请求正在运行时,某些 CPU 可能处于空闲状态。

以下解决方案将生成器和函数作为输入,并将函数应用于生成器生成的每个元素,同时保持给定数量的运行线程(每个线程将函数应用于一个元素)。同时,它仍然按照输入的顺序返回结果。

from concurrent.futures import ThreadPoolExecutor
import os
import random
import time

def map_async(iterable, func, max_workers=os.cpu_count()):
    # Generator that applies func to the input using max_workers concurrent jobs
    def async_iterator():
        iterator = iter(iterable)
        pending_results = []
        has_input = True
        thread_pool = ThreadPoolExecutor(max_workers)
        while True:
            # Submit jobs for remaining input until max_worker jobs are running
            while has_input and \
                    len([e for e in pending_results if e.running()]) \
                        < max_workers:
                try:
                    e = next(iterator)
                    print('Submitting task...')
                    pending_results.append(thread_pool.submit(func, e))
                except StopIteration:
                    print('Submitted all task.')
                    has_input = False

            # If there are no pending results, the generator is done
            if not pending_results:
                return

            # If the oldest job is done, return its value
            if pending_results[0].done():
                yield pending_results.pop(0).result()
            # Otherwise, yield the CPU, then continue starting new jobs
            else:
                time.sleep(.01)

    return async_iterator()

def example_generator():
    for i in range(20):
        print('Creating task', i)
        yield i

def do_work(i):
    print('Starting to work on', i)
    time.sleep(random.uniform(0, 3))
    print('Done with', i)
    return i

random.seed(42)
for i in map_async(example_generator(), do_work):
    print('Got result:', i)

可能执行的注释输出(在具有 8 个逻辑 CPU 的机器上):

Creating task 0
Submitting task...
Starting to work on 0
Creating task 1
Submitting task...
Starting to work on 1
Creating task 2
Submitting task...
Starting to work on 2
Creating task 3
Submitting task...
Starting to work on 3
Creating task 4
Submitting task...
Starting to work on 4
Creating task 5
Submitting task...
Starting to work on 5
Creating task 6
Submitting task...
Starting to work on 6
Creating task 7
Submitting task...
Starting to work on 7      # This point is reached quickly: 8 jobs are started before any of them finishes
Done with 1                # Job 1 is done, but since job 0 is not, the result is not returned yet
Creating task 8            # Job 1 finished, so a new job can be started
Submitting task...
Creating task 9
Starting to work on 8
Submitting task...
Done with 7
Starting to work on 9
Done with 9
Creating task 10
Submitting task...
Creating task 11
Starting to work on 10
Submitting task...
Done with 3
Starting to work on 11
Done with 2
Creating task 12
Submitting task...
Creating task 13
Starting to work on 12
Submitting task...
Done with 12
Starting to work on 13
Done with 10
Creating task 14
Submitting task...
Creating task 15
Starting to work on 14
Submitting task...
Done with 8
Starting to work on 15
Done with 13                  # Several other jobs are started and completed
Creating task 16
Submitting task...
Creating task 17
Starting to work on 16
Submitting task...
Done with 0                   # Finally, job 0 is completed
Starting to work on 17
Got result: 0
Got result: 1
Got result: 2
Got result: 3                 # The result of all completed jobs are returned in input order until the job of the next one is still running
Done with 5
Creating task 18
Submitting task...
Creating task 19
Starting to work on 18
Submitting task...
Done with 16
Starting to work on 19
Done with 11
Submitted all task.
Done with 19
Done with 4
Got result: 4
Got result: 5
Done with 6
Got result: 6                  # Job 6 must have been a very long job; now that it's done, its result and the result of many subsequent jobs can be returned
Got result: 7
Got result: 8
Got result: 9
Got result: 10
Got result: 11
Got result: 12
Got result: 13
Done with 14
Got result: 14
Done with 15
Got result: 15
Got result: 16
Done with 17
Got result: 17
Done with 18
Got result: 18
Got result: 19

上述运行耗时约 4.7 秒,而顺序执行(设置 max_workers=1)耗时约 23.6 秒。如果没有避免等待每批最慢执行的优化,执行大约需要 5.3s。根据各个作业时间和max_workers的变化,优化的效果可能更大。