我如何在没有硒的情况下抓取该<selfridges.com>网站的库存信息?

时间:2019-10-23 04:52:26

标签: python web-crawler

网站网址:https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/

我可以通过chrome -> F12 -> network -> XHR获取包含价格信息和库存信息的文件。

价格API网址: https://www.selfridges.com/api/cms/ecom/v1/GB/en/price/byId/317-77011643-LB014200

库存API网址: https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200

我可以通过直接访问API链接来获得响应内容,如下所示:

s= requests.session()
response = s.get(price_api_url, headers=headers)
print(response.text)

但是,对于股票URL,此方法将不起作用,并且将返回403 Forbidden状态代码。

我尝试使用cookie,但结果相同。

即使通过chrome浏览器访问,效果也一样。

也许有用的信息:

我获得了包含API方法的源代码,但是找不到{variantValue}{variantName}

"@data_api":"
    {"apiKeyValue":"xjut2p34999bad9dx7y868ng",
     "apiKey":"Api-Key",
     "withCredentials":true,
     "priceApi":"/api/cms/ecom/v1/GB/en/price/byId/{partNumber}",
     "stockApi":"/api/cms/ecom/v1/GB/en/stock/byId/{partNumber}?option\u003d{variantName}\u0026optionValue\u003d{variantValue}",
     "cacheControl":"no-cache",
     "addToWishListApiUrl":"/api/cms/ecom/v1/GB/en/wishlist",
     "addToBagApiUrl":"/api/cms/ecom/v1/GB/en/cart"
}"

1 个答案:

答案 0 :(得分:0)

Chrome / Firefox中,您应该检查它还发送了其他内容-也许它需要特殊的标头-例如XHR请求('X-Requested-With': 'XMLHttpRequest')的特殊标头。或者,也许您必须首先进入GET主页才能获得新鲜的cookie。

Firefox具有与Chrome类似的工具,并且具有"Copy reuqest as CURL command",在控制台中使用此命令可以获取库存数据。

curl 'https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Accept-Language: pl,en-US;q=0.7,en;q=0.3' --compressed -H 'Content-Type: application/json; charset=utf-8' -H 'Api-Key: xjut2p34999bad9dx7y868ng' -H 'cache-control: no-cache' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'Referer: https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/' -H 'Cookie: AWSELB=85FF15BB10593ECE847219C9B214EEC5BBD393B7301D90E17B625C66620D7473C3FCE779E5EA1D351A2192C6C975C128815AC60F1118B8968E03001896493C045071A25E98; SF_COUNTRY_LANG=GB_en; COOKIE_NOTICE_SEEN=seen; utag_main=v_id:016df6fcb41700231568089828b001044006200900c48$_sn:1$_ss:0$_pn:2%3Bexp-session$_st:1571808694713$ses_id:1571806819351%3Bexp-session; utag_chan={"channel":"","channel_set":"","channel_converted":false,"awc":""}; Apache=10.77.3.197.1571806819436981; JSESSIONID=0000FBk5q2nb8WGtpDUjLBiNvha:17re3pp2r; WC_PERSISTENT=EBTewrGMk86bvcN%2fwqrCZtv%2bnXk%3d%0a%3b2019%2d10%2d23+05%3a00%3a22%2e442%5f1571806819438%2d1407831%5f10052%5f1480243004%2c%2d1%2cGBP%5f10052; WC_SESSION_ESTABLISHED=true; WC_ACTIVEPOINTER=%2d1%2c10052; WC_AUTHENTICATION_1480243004=1480243004%2cQbYKoQJpwYcMM6iznWYL1ludFS8%3d; WC_USERACTIVITY_1480243004=1480243004%2c10052%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cpfXMuSmw4%2b86xW7eYpU03lFrlirAydf27cytgnreiETU0zdlaTYkdIvAFHFrHmqcOVjtNhcyBowU%0ah%2bD2jUFBMXetfiZdIXQuaegcWHNNUqlIHSvMQrpghGvwCVdLsi%2bVK5UuT9NrO2L6RLVuf2ROuIXl%0avrgeD6slXh2C9RTk%2fKYkbRFJrqWGbiO5BZCmcHU14xftVA%3d%3d; cmTPSet=Y; CoreID6=87385145971315718068242&ci=90262645; 90262645_clogin=v=7&l=62021491571806824206&e=1571808675410; SIGNUP_POPUP_SEEN=seen' -H 'DNT: 1'

使用https://curl.trillworks.com/,我可以将CURL转换为Python requests,并且可以获得库存。

import requests

cookies = {
    'AWSELB': '85FF15BB10593ECE847219C9B214EEC5BBD393B7301D90E17B625C66620D7473C3FCE779E5EA1D351A2192C6C975C128815AC60F1118B8968E03001896493C045071A25E98',
    'SF_COUNTRY_LANG': 'GB_en',
    'COOKIE_NOTICE_SEEN': 'seen',
    'utag_main': 'v_id:016df6fcb41700231568089828b001044006200900c48$_sn:1$_ss:0$_pn:2%3Bexp-session$_st:1571808694713$ses_id:1571806819351%3Bexp-session',
    'utag_chan': '{"channel":"","channel_set":"","channel_converted":false,"awc":""}',
    'Apache': '10.77.3.197.1571806819436981',
    'JSESSIONID': '0000FBk5q2nb8WGtpDUjLBiNvha:17re3pp2r',
    'WC_PERSISTENT': 'EBTewrGMk86bvcN%2fwqrCZtv%2bnXk%3d%0a%3b2019%2d10%2d23+05%3a00%3a22%2e442%5f1571806819438%2d1407831%5f10052%5f1480243004%2c%2d1%2cGBP%5f10052',
    'WC_SESSION_ESTABLISHED': 'true',
    'WC_ACTIVEPOINTER': '%2d1%2c10052',
    'WC_AUTHENTICATION_1480243004': '1480243004%2cQbYKoQJpwYcMM6iznWYL1ludFS8%3d',
    'WC_USERACTIVITY_1480243004': '1480243004%2c10052%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cnull%2cpfXMuSmw4%2b86xW7eYpU03lFrlirAydf27cytgnreiETU0zdlaTYkdIvAFHFrHmqcOVjtNhcyBowU%0ah%2bD2jUFBMXetfiZdIXQuaegcWHNNUqlIHSvMQrpghGvwCVdLsi%2bVK5UuT9NrO2L6RLVuf2ROuIXl%0avrgeD6slXh2C9RTk%2fKYkbRFJrqWGbiO5BZCmcHU14xftVA%3d%3d',
    'cmTPSet': 'Y',
    'CoreID6': '87385145971315718068242&ci=90262645',
    '90262645_clogin': 'v=7&l=62021491571806824206&e=1571808675410',
    'SIGNUP_POPUP_SEEN': 'seen',
}

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:71.0) Gecko/20100101 Firefox/71.0',
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Accept-Language': 'pl,en-US;q=0.7,en;q=0.3',
    'Content-Type': 'application/json; charset=utf-8',
    'Api-Key': 'xjut2p34999bad9dx7y868ng',
    'cache-control': 'no-cache',
    'X-Requested-With': 'XMLHttpRequest',
    'Connection': 'keep-alive',
    'Referer': 'https://www.selfridges.com/GB/en/cat/giorgio-armani-lip-mastero-mattr-6-6ml_317-77011643-LB014200/',
    'DNT': '1',
}

response = requests.get('https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200', headers=headers, cookies=cookies)
print(response.text)

但是我不知道服务器将遵守此代码及其cookie的时间。当我稍后运行它时,它可能需要新鲜的cookie。


编辑:几个小时后,相同的代码仍然为我提供了数据。有时我什至只用

就可以得到结果
import requests

headers = { 'Api-Key': 'xjut2p34999bad9dx7y868ng' }

response = requests.get('https://www.selfridges.com/api/cms/ecom/v1/GB/en/stock/byId/317-77011643-LB014200', headers=headers)
print(response.text)

但是有时它会给我<h1>Developer Inactive</h1>,所以我要确保它不是服务器上的临时问题。