Question

我正在使用一个函数进行超过100K的调用，使用2个函数，我用第一个函数联系到api，并为每个主机获取sysinfo（一个字典），然后使用第二个函数，我通过sysinfo并获取IP地址。我正在寻找一种加快速度的方法，但之前从未使用过多处理/线程（目前大约需要3个小时）。

from multiprocessing import Pool
from multiprocessing.dummy import Pool as ThreadPool

#pool = ThreadPool(4)
p = Pool(5)

#obviously I removed a lot of the code that generates some of these
#variables, but this is the part that slooooows everything down. 

def get_sys_info(self, host_id, appliance):
    sysinfo = self.hx_request("https://{}:3000//hx/api/v3/hosts/{}/sysinfo"
    return sysinfo

def get_ips_from_sysinfo(self, sysinfo):
    sysinfo = sysinfo["data"]
    network_array = sysinfo.get("networkArray", {})
    network_info = network_array.get("networkInfo", [])
    ips = []
    for ni in network_info:
        ip_array = ni.get("ipArray", {})
        ip_info = ip_array.get("ipInfo", [])
        for i in ip_info:
            ips.append(i)
    return ips

if __name__ == "__main__":
    for i in ids:
        sysinfo = rr.get_sys_info(i, appliance)
        hostname = sysinfo.get("data", {}).get("hostname")
        try:
            ips = p.map(rr.get_ips_from_sysinfo(sysinfo))
        except Exception as e:
            rr.logger.error("Exception on {} -- {}".format(hostname, e))
            continue

#Tried calling it here
ips = p.map(rr.get_ips_from_sysinfo(sysinfo))

我必须经历超过100,000个api调用，而这确实是使一切变慢的部分。

我想我已经尝试了一切，并得到了所有可能的可迭代的，缺少参数的错误。

我真的很感谢任何类型的帮助。谢谢！

Answer 1

您可以使用线程和队列进行通信，首先您将启动get_ips_from_sysinfo作为单个线程来监视和处理将存储在sysinfo中的所有完成的output_list，然后触发所有{ {1}}个线程，请注意不要耗尽10万个线程

get_sys_info

Answer 2

正如@wwii所评论的那样，concurrent.futures提供了一些您可能会帮助您的便利，尤其是因为这看起来像是批处理工作。

看来，性能下降最有可能来自网络调用，因此多线程可能更适合您的用例（here是多处理的比较）。如果没有，则可以在使用相同的API的同时将池从线程切换到进程。

from concurrent.futures import ThreadPoolExecutor, as_completed
# You can import ProcessPoolExecutor instead and use the same APIs

def thread_worker(instance, host_id, appliance):
    """Wrapper for your class's `get_sys_info` method"""
    sysinfo = instance.get_sys_info(host_id, appliance)
    return sysinfo, instance

# instantiate the class that contains the methods in your example code
# I will call it `RR`
instances = (RR(*your_args, **your_kwds) for your_args, your_kwds 
    in zip(iterable_of_args, iterable_of_kwds))
all_host_ids = another_iterable
all_appliances = still_another_iterable

if __name__ == "__main__":
   with ThreadPoolExecutor(max_workers=50) as executor:  # assuming 10 threads per core; your example uses 5 processes
        pool = {executor.submit(thread_worker, instance, _id, _app): (_id, _app)
            for _id, _app in zip(instances, all_host_ids, all_appliances)}

        # handle the `sysinfo` dicts as they arrive
        for future in as_completed(pool):
            _result = future.result()
            if isinstance(_sysinfo, Exception):  # just one way of handling exceptions
                # do something
                print(f"{pool[future]} raised {future.result()}")
            else:
                # enqueue results for parallel processing in a separate stage, or
                # process the results serially
                _sysinfo, _instance = _result
                ips = _instance.get_ips_from_sysinfo(_sysinfo)
                # do something with `ips`

您可以通过将方法重构为函数来简化此示例，如果它们确实不像代码中那样使用状态的话。

如果提取sysinfo数据很昂贵，则可以将结果放入队列，然后将结果馈送到ProcessPoolExecutor上，该get_ips_from_sysinfo对排队的字典调用import turtle i = int(input(">>> ")) while True: turtle.forward(i) i = int(input(">>> ")) if i == 0: break。

Answer 3

无论出于何种原因，我对在多个线程中调用实例方法都不太满意-但这似乎行得通。我使用concurrent.futures制作了这个玩具示例-希望它能很好地模仿您的实际情况。这会将4000个实例方法调用提交给（最多）500个工作人员的线程池。在使用max_workers值的情况下，我发现执行时间的改进是线性的，最多约有1000名工人，然后改进 ratio 开始逐渐消失。

import concurrent.futures, time, random

a = [.001*n for n in range(1,4001)]

class F:
    def __init__(self, name):
        self.name = f'{name}:{self.__class__.__name__}'
    def apicall(self,n):
        wait = random.choice(a)
        time.sleep(wait)
        return (n,wait, self.name)

f = F('foo')

if __name__ == '__main__':
    nworkers = 500
    with concurrent.futures.ThreadPoolExecutor(nworkers) as executor:
#        t = time.time()
        futures = [executor.submit(f.apicall, n) for n in range(4000)]
        results = [future.result() for future in concurrent.futures.as_completed(futures)]
#        t = time.time() - t
#    q = sum(r[1] for r in results)
#    print(f'# workers:{nworkers} - ratio:{q/t}')

我没有考虑方法调用期间可能引发的异常，但是文档中的示例非常清楚如何处理该异常。

Answer 4

所以...经过几天的研究后（非常感谢您！）和一些外部阅读（流利的Python Ch 17和有效的Python 59特定方式..）

function encrypt($data, $secret)  {
    $key = sha1(mb_convert_encoding($secret, "UTF-8"), true);                   // Create SHA-1 hash (20 byte) 
    $key = str_pad($key, 24, "\0");                                             // Extend to 24 byte by appending 0-values (would also happen automatically on openssl_encrypt-call)
    $encrypted = openssl_encrypt($data, 'DES-EDE3', $key, OPENSSL_RAW_DATA);    // Encryption: DESede (24 byte key), ECB-mode, PKCS5-Padding
    return base64_encode($encrypted);                                           // Base64-encoding
}

*修改后可以立即使用，希望对其他人有帮助

如何用字典修复多线程/多处理？

4 个答案: