Question

我有一个相对URL列表-PostLink和一个base_url。我通过for loop向每个网址发出了请求。效果很好，大约需要六分钟。

import requests
baseurl = 'http://www.aaronsw.com/weblog/'
bowls = [requests.get(baseurl + i) for i in PostLink]

现在，我认为这项工作是I / O密集型工作，我希望通过多线程来加快爬网速度。

我尝试过

from concurrent.futures import ThreadPoolExecutor 
pool = ThreadPoolExecutor(6)
res = []
for i in PostLink:
    future = pool.submit(requests.get, (baseurl + i))
    res.append(future.result())

我认为我做错了。任何帮助表示赞赏。

Answer 1

下面是一些代码，可对项目列表进行多处理并针对列表中的每个项目并行执行your_function

    from multiprocessing import Pool, cpu_count

def multi_processor(function_name):

    file_list = []

    # Test, put 6 strings in the list so your_function should run six times with 6 processors in parallel (assuming your CPU has that many cores)
    file_list.append("test1")
    file_list.append("test2")
    file_list.append("test3")
    file_list.append("test4")
    file_list.append("test5")
    file_list.append("test6")

    # Use max number of system processors - 1
    pool = Pool(processes=cpu_count()-1)
    pool.daemon = True

    results = {}
    # for every  file in the file list, start a new process
    for each_file in file_list:
        results[each_file] = pool.apply_async(function_name, args=("arg1", "arg2"))

    # Wait for all processes to finish before proceeding
    pool.close()
    pool.join()

    # Results and any errors are returned
    return {your_function: result.get() for your_function, result in results.items()}


def your_function(arg1, arg2):
    try:
        print("put your stuff in this function")
        your_results = ""
        return your_results
    except Exception as e:
        return str(e)

if __name__ == "__main__":
    some_results = multi_processor(your_function)
    print(some_results)

Answer 2

使用multiprocessing大约需要54秒。

from multiprocessing import Pool
with Pool(6) as p:
    bowls = p.map(requests.get, [baseurl+i for i in PostLink])

通过多线程向许多网址发出请求

2 个答案: