我在取消网站时收到服务器的302响应:
2014-04-01 21:31:51+0200 [ahrefs-h] DEBUG: Redirecting (302) to <GET http://www.domain.com/Site_Abuse/DeadEnd.htm> from <GET http://domain.com/wps/showmodel.asp?Type=15&make=damc&a=664&b=51&c=0>
我想向GET网址发送请求,而不是重定向。现在我找到了这个中间件:
https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/downloadermiddleware/redirect.py#L31
我将此重定向代码添加到了我的middleware.py文件中,并将其添加到了settings.py:
中DOWNLOADER_MIDDLEWARES = {
'street.middlewares.RandomUserAgentMiddleware': 400,
'street.middlewares.RedirectMiddleware': 100,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
但我仍然被重定向。这是我为了让这个中间件工作所必须做的吗?我想念一下吗?
答案 0 :(得分:11)
在这种情况下忘记了中间件,这样就可以了:
meta = {'dont_redirect': True,'handle_httpstatus_list': [302]}
也就是说,当您提出请求时,您需要包含元参数:
yield Request(item['link'],meta = {
'dont_redirect': True,
'handle_httpstatus_list': [302]
}, callback=self.your_callback)
答案 1 :(得分:2)
无法解释的302
响应,例如从在Web浏览器中加载良好的页面重定向到主页或某个固定页面,通常表示服务器端采取了措施来阻止不希望的活动。
您必须降低抓取速度或使用智能代理(例如Crawlera)或代理轮换服务,并在收到此类响应后重试您的请求。
要重试这种响应,请将'handle_httpstatus_list': [302]
添加到源请求的meta
中,并检查回调中是否有response.status == 302
。如果是这样,请通过产生response.request.replace(dont_filter=True)
重试您的请求。
重试时,还应使代码限制任何给定URL的最大重试次数。您可以保留字典以跟踪重试:
class MySpider(Spider):
name = 'my_spider'
max_retries = 2
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.retries = {}
def start_requests(self):
yield Request(
'https://example.com',
callback=self.parse,
meta={
'handle_httpstatus_list': [302],
},
)
def parse(self, response):
if response.status == 302:
retries = self.retries.setdefault(response.url, 0)
if retries < self.max_retries:
self.retries[response.url] += 1
yield response.request.replace(dont_filter=True)
else:
self.logger.error('%s still returns 302 responses after %s retries',
response.url, retries)
return
根据情况,您可能需要将此代码移至downloader middleware。
答案 2 :(得分:1)
我将此重定向代码添加到了我的middleware.py文件中,并将其添加到了settings.py:
中
DOWNLOADER_MIDDLEWARES_BASE
表示RedirectMiddleware
默认已启用,因此您所做的并不重要。
我想向GET网址发送请求,而不是重定向。
如何?服务器会在302
请求中以GET
回复。如果您再次在同一网址上GET
,则会再次重定向。
你想要达到什么目标?
如果您不想被重定向,请参阅以下问题:
答案 3 :(得分:1)
使用HTTPCACHE_ENABLED = True
时,我遇到了无限循环重定向的问题。我设法通过设置HTTPCACHE_IGNORE_HTTP_CODES = [301,302]
。
答案 4 :(得分:1)
我通过以下方法弄清楚了如何绕过重定向:
1-检查是否在parse()中重定向了。
2-如果重定向,则安排模拟转义此重定向的操作并返回所需的URL进行抓取,您可能需要检查google chrome中的网络行为并模拟请求的POST以返回到您的页面。
3-转到另一个进程,使用回调,然后通过递归循环调用自身来完成所有抓取工作,并在最后放置条件以打破此循环。
在下面的示例中,我曾经绕过免责声明页面,然后返回到我的主网址并开始抓取。
from scrapy.http import FormRequest
import requests
class ScrapeClass(scrapy.Spider):
name = 'terrascan'
page_number = 0
start_urls = [
Your MAin URL , Or list of your URLS, or Read URLs fro file to a list
]
def parse(self, response):
''' Here I killed Disclaimer page and continued in belwo proc with follow !!!'''
# Get Currently Requested URL
current_url = response.request.url
# Get All Followed Redirect URLs
redirect_url_list = response.request.meta.get('redirect_urls')
# Get First URL Followed by Spiders
redirect_url_list = response.request.meta.get('redirect_urls')[0]
# handle redirection as below ( check redirection !! , got it form redirect.py
# in \downloadermiddlewares Folder
allowed_status = (301, 302, 303, 307, 308)
if 'Location' in response.headers or response.status in allowed_status: # <== this is condition of redirection
print(current_url, '<========= am not redirected @@@@@@@@@@')
else:
print(current_url, '<====== kill that please %%%%%%%%%%%%%')
session_requests = requests.session()
# got all below data from monitoring network behaviour in google chrome when simulating clicking on 'I Agree'
headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'ctl00$cphContent$btnAgree': 'I Agree'
}
# headers_ = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'}
# Post_ = session_requests.post(current_url, headers=headers_)
Post_ = session_requests.post(current_url, headers=headers_)
# if Post_.status_code == 200: print('heeeeeeeeeeeeeeeeeeeeeey killed it')
print(response.url , '<========= check this please')
return FormRequest.from_response(Post_,callback=self.parse_After_disclaimer)
def parse_After_disclaimer(self, response):
print(response.status)
print(response.url)
# put your condition to make sure that the current url is what you need, other wise escape again until you kill redirection
if response.url not in [your lis of URLs]:
print('I am here brother')
yield scrapy.Request(Your URL,callback=self.parse_After_disclaimer)
else:
# here you are good to go for scraping work
items = TerrascanItem()
all_td_tags = response.css('td')
print(len(all_td_tags),'all_td_results',response.url)
# for tr_ in all_tr_tags:
parcel_No = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbParcelNumber::text').extract()
Owner_Name = all_td_tags.css('#ctl00_cphContent_ParcelOwnerInfo1_lbOwnerName::text').extract()
if parcel_No:items['parcel_No'] = parcel_No
else: items['parcel_No'] =''
yield items
# Here you put the condition to recursive call of this process again
#
ScrapeClass.page_number += 1
# next_page = 'http://terrascan.whitmancounty.net/Taxsifter/Search/results.aspx?q=[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]&page=' + str(terraScanSpider.page_number) + '&1=1#rslts'
next_page = Your URLS[ScrapeClass.page_number]
print('am in page #', ScrapeClass.page_number, '===', next_page)
if ScrapeClass.page_number < len(ScrapeClass.start_urls_AfterDisclaimer)-1: # 20
# print('I am loooooooooooooooooooooooping again')
yield response.follow(next_page, callback=self.parse_After_disclaimer)
答案 5 :(得分:0)
您可以通过在settings.py中将REDIRECT_ENABLED
设置为False来禁用RedirectMiddleware