Question

我需要从网站上抓取大量链接。我有约70个基本链接，而从这些链接中需要删除70个以上的链接。因此，为了加快此过程，大约需要2-3个小时而不进行线程/异步处理，我决定尝试使用一个线程/ async。

我的问题是我需要渲染一些javascript才能首先获得链接。我一直在使用requests-html来执行此操作，因为它的html.render（）方法非常可靠。但是，当我尝试使用线程或异步来运行它时，我遇到了很多问题。由于这个Github PR，我尝试了AsyncHTMLSession，但无法使其正常工作。我想知道是否有人有什么想法或链接可以指向我，这可能会有所帮助。

这是一些示例代码：

from multiprocessing.pool import ThreadPool
from requests_html import AsyncHTMLSession

links = (tuple of links)
n = 5
batch = [links[i:i+n] for i in range(0, len(links), n)]


def link_processor(batch_link):
    session = AsyncHTMLSession()
    results = []

    for l in batch_link:
        print(l)
        r = session.get(l)
        r.html.arender()
        tmp_next = r.html.xpath('//a[contains(@href, "/matches/")]')

    return tmp_next


pool = ThreadPool(processes=2)

output = pool.map(link_processor, batch)
pool.close()
pool.join()
print(output)

输出：

RuntimeError: There is no current event loop in thread 'Thread-1'.

能够在Learnpython subreddit的一些帮助下解决此问题。事实证明，requests-html可能以某种方式使用了线程，因此线程化存在一个问题，因此仅使用多处理池即可。

固定代码：

from multiprocessing import Pool
from requests_html import HTMLSession

.....

pool = Pool(processes=3)
output = pool.map(link_processor, batch[:2])
pool.close()
pool.join()
print(output)

Requests-html中的线程/异步

0 个答案: