for循环中的Python多处理(请求和BeautifulSoup)

时间:2018-06-27 07:07:41

标签: python multithreading beautifulsoup python-requests multiprocessing

我有很多链接的列表,我想使用多重处理来加快处理速度,这是简化版本,我需要像这样订购它:

click

我尝试了很多事情,进程,缓冲池等。我总是有错误,我需要使用4或8个线程来进行处理,并使其像这样进行排序。感谢您的所有帮助。这是代码:

from bs4 import BeautifulSoup
import requests
import time

links = ["http://www.tennisexplorer.com/match-detail/?id=1672704", "http://www.tennisexplorer.com/match-detail/?id=1699387", "http://www.tennisexplorer.com/match-detail/?id=1698990" "http://www.tennisexplorer.com/match-detail/?id=1696623", "http://www.tennisexplorer.com/match-detail/?id=1688719", "http://www.tennisexplorer.com/match-detail/?id=1686305"]

data = []

def essa(match, omega):
    aaa = BeautifulSoup(requests.get(match).text, "lxml")
    center = aaa.find("div", id="center")
    p1_l = center.find_all("th", class_="plName")[0].find("a").get("href")
    p2_l = center.find_all("th", class_="plName")[1].find("a").get("href")
    return p1_l + " - " + p2_l + " - " + str(omega)

i = 1

start_time = time.clock()

for link in links:
    data.append(essa(link, i))
    i += 1

for d in data:
    print(d)

print(time.clock() - start_time, "seconds")

2 个答案:

答案 0 :(得分:0)

产生该函数的多个线程并将它们连接在一起:

from threading import Thread

def essa(match, omega):
    aaa = BeautifulSoup(requests.get(match).text, "lxml")
    center = aaa.find("div", id="center")
    p1_l = center.find_all("th", class_="plName")[0].find("a").get("href")
    p2_l = center.find_all("th", class_="plName")[1].find("a").get("href")
    print p1_l + " - " + p2_l + " - " + str(omega)


if __name__ == '__main__':
    threadlist = []
    for index, url in enumerate(links):
        t= Thread(target=essa,args=(url, index))
        t.start()
        threadlist.append(t)
    for b in threadlist:
        b.join()

您不会按顺序打印它们,原因很简单,因为某些http响应比其他响应花费更长的时间。

答案 1 :(得分:0)

据我所知,您拥有链接列表并可以同时发出请求,以加快流程速度。这是多线程的示例代码。我希望这能帮到您。阅读有关并发期货的文档。

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))