我正在尝试从Valero商店定位器获取位置。使用chrome dev工具,我看到以下请求与我最终需要获取的数据(存储位置)有关。我不确定我是否需要使用所有这些内容,但是我将它们全部包括在内以供参考。
在DOC中获取请求
https://www.valero.com/en-us/ProductsAndServices/Consumers/StoreLocator
如果我们打开开发工具并输入邮政编码,则可以查看随后发出的请求以及如何返回数据。
DOC中的POST请求
在XHR中获取请求(我们需要这些吗?)
https://www.valero.com/en-us/_api/web/lists/getbytitle('Headerlinks')/项?$ orderby = valeroOrder
https://www.valero.com/en-us/_api/web/lists/getbytitle('FooterLinks')/ items?$ orderby = Group,Order0
https://www.valero.com/en-us/_api/web/lists/getbytitle('警报')/items?$filter=Begins%20lt%20datetime%272018-09-19T16:51:52.140Z%27%20and%20Expires%20gt%20datetime%272018 -09-19T16:51:52.140Z%27
XHR中的POST请求
https://valeromaps.valero.com/Home/GetDetailMaster?SPHostUrl=https%3A%2F%2Fwww.valero.com%2Fen-us
https://valeromaps.valero.com/Home/Search?SPHostUrl=https%3A%2F%2Fwww.valero.com%2Fen-us
草率的配置说明
•“ COOKIES_ENABLED” =真
•覆盖草率的默认用户代理
下面的代码显示了我要执行的操作。在开始请求中建立cookie,然后回调试图模仿网站上正在发生的后续请求。我遇到的问题是,在启动请求后设置cookie,然后将其传递给parse_page1请求,但未在任何后续请求中传递,我认为这导致最后一个请求返回消息,“请求的URL被拒绝。”而不是位置(尽管响应为200)。注意:parse_request()只是打印出响应,而不是实际上解析出位置-一旦知道可以返回位置,我将进行更新。
def start_requests(self):
url = 'https://www.valero.com/en-us/ProductsAndServices/Consumers/StoreLocator'
headers = {
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'DNT': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'If-Modified-Since': 'Tue, 18 Sep 2018 19:11:50 GMT',
}
yield scrapy.Request(url=url, headers=headers, callback=self.parse_page1)
def parse_page1(self, response):
r = response.text
senderid = re.findall('spAppIFrameSenderInfo\[0\] = new Array\("(.*?)"', r)[0]
url = 'https://www.valero.com/en-us/_layouts/15/appredirect.aspx?redirect_uri=https%3A%2F%2Fvaleromaps%2Evalero%2Ecom%2Fhome%3FSPHostUrl%3Dhttps%253A%252F%252Fwww%252Evalero%252Ecom%252Fen%252Dus%26SPHostTitle%3DValero%2520%252D%26SPAppWebUrl%3D%22%22%26SPLanguage%3Den%252DUS%26SPClientTag%3D1%26SPProductNumber%3D15%252E0%252E5047%252E1000%26SenderId%3D{0}&client_id=i%3A0i%2Et%7Cms%2Esp%2Eext%7Cb238ea69%2D5f91%2D445c%2D8a7d%2Df55c52f4d807%408bf952c5%2Def34%2D4ac6%2D822d%2D099871ec78da&anon=1'.format(senderid)
yield scrapy.Request(url=url, method='POST', callback=self.parse_page2, meta={'senderid': senderid})
def parse_page2(self, response):
senderid = response.meta['senderid']
url = 'https://valeromaps.valero.com/home?SPHostUrl=https%3A%2F%2Fwww%2Evalero%2Ecom%2Fen%2Dus&SPHostTitle=Valero%20%2D&SPAppWebUrl=%22%22&SPLanguage=en%2DUS&SPClientTag=1&SPProductNumber=15%2E0%2E5047%2E1000&SenderId={0}'.format(senderid)
form = {
'SPAppToken': '',
'SPSiteUrl': 'https://www.valero.com/en-us',
'SPSiteTitle': 'Valero -',
'SPSiteLogoUrl': '',
'SPSiteLanguage': 'en-US',
'SPSiteCulture': 'en-US',
'SPRedirectMessage': 'EndpointAuthorityMatches',
'SPErrorCorrelationId': '',
'SPErrorInfo': ''
}
yield scrapy.Request(url=url, method='POST', body=json.dumps(form), callback=self.parse_page3, meta={'senderid': senderid})
def parse_page3(self, response):
url = 'https://valeromaps.valero.com/Home/Search?SPHostUrl=https%3A%2F%2Fwww.valero.com%2Fen-us'
form = {
'NEBound_Lat': '31.943824833980116',
'NEBound_Long': '-94.08231139453125',
'SWBound_Lat': '27.167727791447785',
'SWBound_Long': '-103.12955260546875',
'center_Lat': '29.555776312713952',
'center_Long': '-98.605932',
}
yield scrapy.Request(url=url, method='POST', body=json.dumps(form), callback=self.parse)
def parse_request(self, response):
print(response.text)
是否有一种方法可以确保cookie在每个回调中都持久存在?我相信这将有助于解决该问题,但是,如果有人注意到一种更好的方式来获取需要发送到返回数据的请求的Cookie(上面的#8),那么我将非常感谢您的反馈。>
答案 0 :(得分:0)
当然,问题不在于cookie,而在于标题。您仅在第一个请求中使用自定义标头 。
yield scrapy.Request(
url=url,
headers=headers, #
callback=self.parse_page1)
#...
yield scrapy.Request( #❓
url=url,
method='POST',
callback=self.parse_page2,
meta={'senderid': senderid})`
您可以将其定义为custom_settings
,以使它们适用于蜘蛛网中的每个请求:
class MySpider(scrapy.Spider):
custom_settings = {
'DEFAULT_REQUEST_HEADERS': {
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
'DNT': '1',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'If-Modified-Since': 'Tue, 18 Sep 2018 19:11:50 GMT',
},
}
有关custom_settings
的更多信息:https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.custom_settings