python parallel发送1000+ url请求并获取内容信息

时间:2018-03-05 23:49:17

标签: python multithreading parallel-processing callback

我有一个功能: 获取新文章的标题和内容并附加到列表中。

我想: 拨打1000+网址并运行该功能

目标: 只需一次运行,我就可以在列表中获得1000+标题和内容,而无需循环遍历每个URL并按顺序调用。

到目前为止我的代码:

设置

import requests
from newspaper import Article
import threading

1)这是"做工作"功能

def get_url_info(url):
    try:
        r = requests.head(url) 
        if r.status_code < 400: # if loads
            article = Article(url)
            article.download()
            article.parse()
            if detect(article.title) == 'en': #English only
                if len(article.text)<50: #filter out permission request
                    title = (article.title.encode('ascii', errors='ignore')) 
                    text = (article.text.encode('ascii', errors='ignore'))
                    test_url= url
    except Exception as e:    
        issue = url #storing issue urls
        print(e, url) 

return title, text, test_url

2)这是列表函数的实际附加:

def get_text_list():
    text_list = [] #article content list
    test_urls = [] #urls taht works
    title_list = [] #article titles
    url_list = get_tier_3()[:8000]  #get first 8000 english texts for testing
    threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
    for thread in threads:
    #originally this was for url in url_list
        thread.start()
        """
        title, text, test_url = call do work here
        title_list.append(title)
        text_list.append(text)
        test_urls.append(test_url)
        """
        print (i) #counts number of urls from DB processed

return text_list, test_urls, title_list

问题: 在设置线程并从每个线程获取信息后,我不知道如何继续

1 个答案:

答案 0 :(得分:0)

我认为multiprocessing模块可能更适合这项任务。由于CPython的实现方式,它无法使用threading模块真正实现CPU绑定任务(如HTTP请求)的基于线程的并发。

我还建议不要为您要处理的每个URL生成单独的线程或进程。您极不可能以这种方式实现任何性能提升,因为所有线程或进程都将争夺系统资源。

通常,更好的解决方案是生成较少数量的线程/进程,并将一组URL委派给每个线程/进程进行处理。使用multiprocessing pool实现此目的的简单方法如下:

from multiprocessing import Pool

NUM_PROCS = 4  # example number of processes to be used in multiprocessing

def get_url_info(url):
    ...

def get_text_list():

    # Get your list of URLs
    url_list = get_tier_3()[:8000]

    # Output lists
    title_list, text_list, test_urls = [], [], []

    # Initialize a multiprocessing pool that will close after finishing execution.
    with Pool(NUM_PROCS) as pool:
        results = pool.map(get_url_info, url_list)

    for title, text, test_url in results:
        title_list.append(title)
        text_list.append(text)
        test_urls.append(test_url)

    return title_list, text_list, test_urls

我希望这有帮助!