我尝试从kayak.com上抓取一些航班数据,但是如果我输入结果页面的URL,它将继续将我重定向到机器人验证码页面。
我尝试使用scrapy-user-agent和scrapy-fake-useragent-fix以某种方式仍然重现相同的结果
import scrapy
class FlightSpider(scrapy.Spider):
name = 'kayak'
allowed_domains = 'www.kayak.com/'
start_urls = [
'https://www.kayak.com.au/flights/PER-MEL/2019-05-01?sort=price_a'
]
handle_httpstatus_list = [302]
def parse(self, response):
#test to save the result page in HTML
filename = 'test-1.html'
with open(filename, 'wb') as f:
f.write(response.body)
#extract the departure time
for flights_time in response.xpath("//div[@class='resultWrapper']"):
yield {
'dep_time' : flights_time.xpath(".//span[@class='depart-time base-time']").extract_first()
}
这是我得到的错误
2019-04-16 18:28:48 [scrapy.core.engine] DEBUG: Crawled (302) <GET https://www.kayak.com.au/flights/PER-MEL/2019-05-01?sort=price_a> (referer: https://www.kayak.com)
答案 0 :(得分:0)
它一直将我重定向到机器人验证码页面。
因为他们知道您是机器人。您可以做三件事来防止这种情况:
示例代码:
DOWNLOAD_DELAY=0.75
RANDOMIZE_DOWNLOAD_DELAY=True
AUTOTHROTTLE_ENABLED=True
AUTOTHROTTLE_MAX_DELAY=4
AUTOTHROTTLE_START_DELAY=1.25
AUTOTHROTTLE_DEBUG=False
AUTOTHROTTLE_TARGET_CONCURRENCY=1.5
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN=1
CONCURRENT_REQUESTS_PER_IP=1 #Default 4
DOWNLOADER_MIDDLEWARES
。 DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
'scraper1.middlewares.randomproxy.RandomProxyMiddleware': 100,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scraper1.middlewares.randomuseragent.RandomUserAgentMiddleware': 400,
}
在这里您可以添加一些自定义的中间件以达到您的目的。尝试获取100/200个用户代理的简单列表:https://developers.whatismybrowser.com/useragents/explore/
在中间件中是这样的:
class RandomProxyMiddleware(object):
def __init__(self, settings):
super(RandomProxyMiddleware, self).__init__()
# Get proxies from DB
try:
proxy_list = ProxyService.get_proxies()
except:
logging.critical('Failed to get proxies')
self.proxies = []
for proxy in proxy_list:
self.proxies.append('http://' + str(proxy[0]) + ':' + str(proxy[1]))
编辑的中间件代理示例。
class RandomProxyMiddleware(object):
def __init__(self, settings):
super(RandomProxyMiddleware, self).__init__()
self.proxies = [
'proxy1.com:8000',
'proxy2.com:8031'
]
您还可以查看以下库:https://pypi.org/project/scrapy-rotating-proxies/
实际上会做一些额外的事情,例如检查代理是否仍然有效。
答案 1 :(得分:0)
尝试在settings.py {仅抓取}中进行以下设置:
USER_AGENT = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/532.2 (KHTML, like Gecko) ChromePlus/4.0.222.3 Chrome/4.0.222.3 Safari/532.2'
DEFAULT_REQUEST_HEADERS = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'cookie': 'Apache=; kykprf=158; cluster=5; kayak=; p1.med.sid=; NSC_q5-tqbslmf=; xp-session-seg=; kayak.mc=; _pxhd=""; G_ENABLED_IDPS=; NSC_q5-lbqj=;',
'pragma': 'no-cache',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}
后来使用节流阀和proxy_rotation使其保持稳定。