Question

我尝试从kayak.com上抓取一些航班数据，但是如果我输入结果页面的URL，它将继续将我重定向到机器人验证码页面。

我尝试使用scrapy-user-agent和scrapy-fake-useragent-fix以某种方式仍然重现相同的结果

import scrapy

class FlightSpider(scrapy.Spider):
    name  = 'kayak'
    allowed_domains = 'www.kayak.com/'

    start_urls = [
        'https://www.kayak.com.au/flights/PER-MEL/2019-05-01?sort=price_a'
    ]

    handle_httpstatus_list = [302]

    def parse(self, response):
        #test to save the result page in HTML
        filename = 'test-1.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

        #extract the departure time
        for flights_time in response.xpath("//div[@class='resultWrapper']"):
            yield {
                'dep_time' : flights_time.xpath(".//span[@class='depart-time base-time']").extract_first()
            }

这是我得到的错误

2019-04-16 18:28:48 [scrapy.core.engine] DEBUG: Crawled (302) <GET https://www.kayak.com.au/flights/PER-MEL/2019-05-01?sort=price_a> (referer: https://www.kayak.com)

Answer 1

它一直将我重定向到机器人验证码页面。

因为他们知道您是机器人。您可以做三件事来防止这种情况：

在settings.py中调整以下变量。这样，您可以使通话更加随机。这是改进刮板的最简单方法。但是，我认为绕过皮划艇安全是不够的。

示例代码：

DOWNLOAD_DELAY=0.75                                                        
RANDOMIZE_DOWNLOAD_DELAY=True                             
AUTOTHROTTLE_ENABLED=True
AUTOTHROTTLE_MAX_DELAY=4
AUTOTHROTTLE_START_DELAY=1.25
AUTOTHROTTLE_DEBUG=False
AUTOTHROTTLE_TARGET_CONCURRENCY=1.5        
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN=1
CONCURRENT_REQUESTS_PER_IP=1 #Default 4

用户代理，您已经提到过。在settings.py中，您可以设置DOWNLOADER_MIDDLEWARES。

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
    'scraper1.middlewares.randomproxy.RandomProxyMiddleware': 100,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scraper1.middlewares.randomuseragent.RandomUserAgentMiddleware': 400,
}

在这里您可以添加一些自定义的中间件以达到您的目的。尝试获取100/200个用户代理的简单列表：https://developers.whatismybrowser.com/useragents/explore/

代理的计数相同。您将需要不同的代理，以使您的请求尽可能地随机。

在中间件中是这样的：

class RandomProxyMiddleware(object):
    def __init__(self, settings):
        super(RandomProxyMiddleware, self).__init__()

        # Get proxies from DB
        try:
            proxy_list = ProxyService.get_proxies()
        except:
            logging.critical('Failed to get proxies')

        self.proxies = []

        for proxy in proxy_list:
            self.proxies.append('http://' + str(proxy[0]) + ':' + str(proxy[1]))

编辑的中间件代理示例。

class RandomProxyMiddleware(object):
    def __init__(self, settings):
        super(RandomProxyMiddleware, self).__init__()

        self.proxies = [
            'proxy1.com:8000',
            'proxy2.com:8031'
        ]

您还可以查看以下库：https://pypi.org/project/scrapy-rotating-proxies/

实际上会做一些额外的事情，例如检查代理是否仍然有效。

Answer 2

尝试在settings.py {仅抓取}中进行以下设置：

USER_AGENT  = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) ChromePlus/4.0.222.3 Chrome/4.0.222.3 Safari/532.2'

DEFAULT_REQUEST_HEADERS = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'no-cache',
    'cookie': 'Apache=; kykprf=158; cluster=5; kayak=; p1.med.sid=; NSC_q5-tqbslmf=; xp-session-seg=; kayak.mc=; _pxhd=""; G_ENABLED_IDPS=; NSC_q5-lbqj=;',
    'pragma': 'no-cache',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}

后来使用节流阀和proxy_rotation使其保持稳定。

如何重定向到结果页面并使用scrapy从那里刮取？

2 个答案: