网页抓取时选择店铺位置

时间:2021-04-10 02:34:24

标签: python-3.x web-scraping beautifulsoup python-requests

我正在浏览一个杂货网站 (https://www.paknsaveonline.co.nz),以便在购物前做一些膳食计划。产品的价格因商店的位置而异。我想从我当地的商店(奥尔巴尼)中提取价格。

我是网络抓取的新手,但我假设我的代码必须

  1. 将默认商店更改为我的本地商店(奥尔巴尼,使用此网址:https://www.paknsaveonline.co.nz/CommonApi/Store/ChangeStore?storeId=65defcf2-bc15-490e-a84f-1f13b769cd22
  2. 维护一个请求“会话”,以确保我从同一个商店站点抓取我的所有产品。

我的抓取代码成功抓取了西兰花的价格,但价格与我当地商店的价格不一致。在发布我的西兰花刮擦价格时为 1.99 美元,但是当我在奥尔巴尼商店手动检查价格时,价格为 0.99 美元。 我假设我切换到正确商店的代码没有按预期工作。

谁能指出我做错了什么并提出解决方案?

环境详情:

  • 请求==2.23.0
  • beautifulsoup4==4.6.3
  • Python 3.7.10

下面的代码,带有指向 Google Colab 文件的关联链接。

import requests
from bs4 import BeautifulSoup as bs
import re

dollars_pattern = '>([0-9][0-9]?)'
cents_pattern = '>([0-9][0-9])'
url = 'https://www.paknsaveonline.co.nz/CommonApi/Store/ChangeStore?storeId=65defcf2-bc15-490e-a84f-1f13b769cd22'
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}   }

with requests.session() as s:
  #I assume this url changes the store (200 response)
  s.get(url)
  #use the same session to return broccoli price
  r = s.get('https://www.paknsaveonline.co.nz/product/5039956_ea_000pns?name=broccoli')
  soup = bs(r.content,'html.parser')
  cents =  str(soup.find_all('span', {'class': "fs-price-lockup__cents"}))
  dollars =  str(soup.find_all('span', {'class': "fs-price-lockup__dollars"}))
  centsprice =re.findall(cents_pattern, cents)
  dollarsprice = re.findall(dollars_pattern, dollars)
  print(dollarsprice, centsprice)

Google Colab file

1 个答案:

答案 0 :(得分:1)

当我看到实际请求时,您需要首先从基本 URL 获取一些 cookie,然后您可以更改该会话的存储,您不能通过调用该 URL 直接修改存储,因此首先您调用基本 URL,然后更改存储 URL,然后再次调用基本 URL 以获取0.99美分价格。

import requests
from bs4 import BeautifulSoup as bs
import re

dollars_pattern = '>([0-9][0-9]?)'
cents_pattern = '>([0-9][0-9])'


url = 'https://www.paknsaveonline.co.nz/CommonApi/Store/ChangeStore?storeId=65defcf2-bc15-490e-a84f-1f13b769cd22'
baseurl="https://www.paknsaveonline.co.nz/product/5039956_ea_000pns?name=broccoli"
header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

with requests.session() as s:
  #I assume this url changes the store (200 response)
  s.get(baseurl)
  s.get(url)
  #use the same session to return broccoli price
  r = s.get(baseurl)
  soup = bs(r.content,'html.parser')
  cents =  str(soup.find_all('span', {'class': "fs-price-lockup__cents"}))
  dollars =  str(soup.find_all('span', {'class': "fs-price-lockup__dollars"}))
  centsprice =re.findall(cents_pattern, cents)
  dollarsprice = re.findall(dollars_pattern, dollars)
  print(dollarsprice, centsprice)

Output

如果您有任何问题,请告诉我:)

相关问题