无法利用不同的代理来执行后续请求

时间:2019-01-11 10:21:22

标签: python python-3.x web-scraping proxy

我已经用python用代理编写了一个脚本,以抓取遍历网页中不同页面的不同文章的链接。 My goal here is to make two subsequesnt requests using different proxies from a list

该脚本从列表中获取随机代理并通过make_requests()函数发送请求,然后再次使用{{1}使用新填充的链接从列表中选择另一个代理,从而再次发出另一个请求。 }功能。

最后,make_ano_requests()函数将打印结果。

但是,如果任何代理不起作用,则会被两个功能get_title()make_requests()中的任何一个踢出列表。

  

当我运行脚本时,它似乎正在运行,但是在执行过程中的某个地方,脚本被卡住了,从不简化任务。我该如何完成任务?

这是我到目前为止编写的内容( proxyVault 此处包含伪造的代理):

make_ano_requests()

2 个答案:

答案 0 :(得分:3)

您的requests.get很可能导致其“挂起”,因为它们没有超时。就像documentation所说:

  

几乎所有生产代码几乎都应使用此参数   要求。否则,可能会导致您的程序挂起   无限期

因此,我建议将其更改为res = requests.get(url, proxies=proxy, timeout=1)以防止其挂起。

但是,它确实是sloooooow。为了加快速度,我建议删除第二个请求,而不是从请求1中获取链接,而是获取字符串[item.string for item in soup.select(".summary .question-hyperlink")],该字符串通常与标题相同。

编辑,添加了用于捕获request.get中的超时的代码:

import random
import requests
from random import choice
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import urllib3

base_url = 'https://stackoverflow.com/questions/tagged/web-scraping'
lead_urls = [f'https://stackoverflow.com/questions/tagged/web-scraping?sort='
            f'newest&page={page}&pagesize=50' for page in range(1, 5)]

linkList = []

proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']

def make_requests(url):
    proxy_url = choice(proxyVault)
    proxy = {'https': f'http://{proxy_url}'}
    try:
        res = requests.get(url, proxies=proxy, timeout=1)
        soup = BeautifulSoup(res.text, "lxml")
        linkList.extend([urljoin(base_url, item.get("href")) for item in soup.select(".summary .question-hyperlink")])
    except (requests.exceptions.ProxyError,
            requests.exceptions.Timeout,
            requests.exceptions.ConnectionError,
            urllib3.exceptions.MaxRetryError):
        if proxy_url in proxyVault:
            proxyVault.remove(proxy_url)
            print(f'kicked out bad proxy by first func: {proxy_url}')
        return make_requests(url)

def make_ano_requests(url):
    proxy_url = choice(proxyVault)
    proxy = {'https': f'http://{proxy_url}'}
    try:
        res = requests.get(url, proxies=proxy, timeout=1)
        get_title(res.text)
    except (requests.exceptions.ProxyError,
            requests.exceptions.Timeout,
            requests.exceptions.ConnectionError,
            urllib3.exceptions.MaxRetryError):
        if proxy_url in proxyVault:
            proxyVault.remove(proxy_url)
            print(f'kicked out bad proxy by second func: {proxy_url}')
        return make_ano_requests(url)

def get_title(response):
    soup = BeautifulSoup(response, "lxml")
    print(soup.select_one("h1[itemprop='name'] a").text)

if __name__ == '__main__':
    for lead_url in lead_urls:
        make_requests(lead_url)

    for single_link in linkList:
        make_ano_requests(single_link)

答案 1 :(得分:0)

您可以通过使用asyncio和aiohttp来加快代理过滤器的过程。像这样:

force_node_config: