Question

我正在努力提高网络抓取工具的速度，并且我有成千上万个网站需要从中获取信息。我正在尝试从Facebook和Yelp获取Google搜索网页中网站的评分和评分数量。我通常只会使用API，但是由于我要搜索的网站列表很多并且时间至关重要，因此Facebook每小时的请求限制很小，因此使用他们的Graph API并不可行（我已经尝试过... ）。我的网站都在Google搜索页中。到目前为止，我所拥有的（我提供了8个可重复性的样本站点）：

from multiprocessing.dummy import Pool
import requests
from bs4 import BeautifulSoup

pools = Pool(8) #My computer has 8 cores
proxies = MY_PROXIES

#How I set up my urls for requests on Google searches. 
#Since each item has a "+" in between in a Google search, I have to format 
#my urls to copy it.

site_list = ['Golden Gate Bridge', 'Statue of Liberty', 'Empire State Building', 'Millennium Park', 'Gum Wall', 'The Alamo', 'National Art Gallery', 'The Bellagio Hotel']

urls = list(map(lambda x: "+".join(x.split(" ")), site_list)

def scrape_google(url_list):

    info = []

    for i in url_list:

        reviews = {'FB Rating': None,
                   'FB Reviews': None,
                   'Yelp Rating': None,
                   'Yelp Reviews': None}   

        request = requests.get(i, proxies=proxies, verify=False).text

        search = BeautifulSoup(search, 'lxml')
        results = search.find_all('div', {'class': 's'}) #Where the ratings roughly are

        for j in results: 

            if 'Rating' in str(j.findChildren()) and 'yelp' in str(j.findChildren()[1]):
                reviews['Yelp Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1] #Had to brute-force get the ratings this way.
                reviews['Yelp Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3]

            elif 'Rating' in str(j.findChildren()) and 'facebook' in str(j.findChildren()[1]):
                reviews['FB Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1]
                reviews['FB Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3]

    info.append(reviews)

return info

results = pools.map(scrape_google, urls)

我尝试了类似的操作，但是我认为重复的结果太多了。多线程会使其运行更快吗？我对代码进行了诊断，以了解哪些部分占用的时间最多，而到目前为止，获得请求是限速因素。

编辑：我刚刚尝试过，但是出现以下错误：

Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

我不明白问题出在什么地方，因为如果我在没有多线程的情况下尝试scrape_google函数，它的工作就很好（尽管非常缓慢），因此url的有效性应该不是问题。

Answer 1

是的，多线程可能会使它运行得更快。

作为一个非常粗略的经验法则，通常您可以并行执行大约8-64个请求，只要对同一主机的请求不超过2-12个。因此，一种非常简单的应用方法是仅将您的所有请求交到concurrent.futures.ThreadPoolExecutor中，例如，有8个工作人员。

实际上是the main example for ThreadPoolExecutor in the docs。

（顺便说一句，您的计算机具有8个核心这一事实是不相关的。您的代码不受CPU限制，而是受I / O限制。如果您并行执行12个请求，甚至在其中执行500个请求，在任何给定的时刻，几乎所有线程都在某个地方等待socket.recv或类似的调用，直到服务器响应为止一直阻塞，因此它们没有使用您的CPU。）

但是：

我认为我得到了太多重复的结果

修复此问题可能不仅仅对线程有帮助。当然，您可以两者都做。

根据您提供的有限信息，我不知道您的问题在哪里，但是有一个非常明显的解决方法：保留到目前为止所看到的所有内容。每当您获得一个新的URL时（如果已在其中），请将其丢弃，而不是排队一个新的请求。

最后：

我通常只会使用API，但是由于我要搜索的网站列表很多并且时间至关重要，因此Facebook每小时的请求量很小，因此不可行

如果您试图绕开主要站点的速率限制，则（a）您可能违反了其T＆C，并且（b）您几乎肯定会触发某种类型的检测并被阻止。¹

在您编辑的问题中，您尝试使用multiprocessing.dummy.Pool.map进行此操作，这很好-但是您弄错了论点。

您的函数获取网址列表并在其上循环：

def scrape_google(url_list):
    # ...
    for i in url_list:

但是您一次只能用一个URL调用它：

results = pools.map(scrape_google, urls)

这类似于使用内置的map或列表理解：

results = map(scrape_google, urls)
results = [scrape_google(url) for url in urls]

如果您获得一个URL而不是它们的列表，但是尝试将其用作列表怎么办？字符串是由其字符组成的序列，因此您将一个一个地遍历URL的字符，并尝试像下载URL一样下载每个字符。

因此，您想更改功能，如下所示：

def scrape_google(url):
    reviews = # …
    request = requests.get(url, proxies=proxies, verify=False).text
    # …
    return reviews

现在，它使用一个URL，并返回该URL的一组评论。 pools.map将使用每个URL对其进行调用，并为您返回一个可重复的评论，每个URL一次。

_{1。或者也许更具创意。几年前，有人在一个网站上发布了一个问题，该网站似乎发送了损坏的响应，该响应似乎是专门为浪费典型的刮板正则表达式而浪费指数型CPU的……}

我是否正确设置了多线程网络抓取工具？

1 个答案: