Question

我正在尝试为以下页面设置网络抓取工具： https://www.autozone.com/external-engine/oil-filter?pageNumber=1

#connect and download html
data = 'https://www.autozone.com/motor-oil-and-transmission-fluid/engine-oil?pageNumber=1'
uclient = urlopen(data)
pagehtml= uclient.read()
uclient.close()
articles = bs(pagehtml,'html.parser')

#separate data by shop items
containers = articles.find_all('div',{'class' : 'shelfItem'})

但是，当我尝试获取价格时，什么也没找到：

containers[0].find_all('div',{'class':'price'})

...使用我的浏览器检查网站时，显示以下内容：

<div class="price" id="retailpricediv_663653_0" style="height: 85px;">Price: <strong>$8.99</strong><br>

我怎么能抢到这8.99美元？

谢谢

Answer 1

您可以通过直接调用api获得所需的数据价格：

import requests

url = 'https://www.autozone.com/rest/bean/autozone/diy/commerce/pricing/PricingServices/retrievePriceAndAvailability?atg-rest-depth=2'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'}
data = {'arg1': 6997, 'arg2':'', 'arg3': '663653,663636,663650,5531,663637,663639,644036,663658,663641,835241,663645,663642', 'arg4': ''}
response = requests.post(url, headers=headers, data=data).json()

for item in response['atgResponse']:
    print(item['retailPrice'])

输出：

要创建data字典，您需要将商店编号传递为arg1，将每个商品ID的列表传递为arg3。 ..

您一次可以获得arg1的值，但是应该在每个页面上提取arg3

page_url = 'https://www.autozone.com/external-engine/oil-filter?pageNumber=1'
r = requests.get(page_url, headers=headers)
source = bs(r.text)
arg1 = source.find('div',{'id' : 'myStoreNum'}).text
arg3 = ",".join([_id['id'].strip('azid') for _id in source.find_all('div',{'class' : 'categorizedShelfItem'})])

因此，您现在可以定义data而无需对值进行硬编码：

data = {'arg1': arg1, 'arg2':'', 'arg3': arg3, 'arg4': ''}

要从下一页获取值，只需将pageNumber=1中的pageNumber=2更改为page_url-其余代码保持不变...

Answer 2

我认为价格是由javascript加载的，因此需要像selenium这样的方法来确保值存在（或其他答案中所示的API调用！）

from selenium import webdriver
import pandas as pd

driver = webdriver.Chrome()
driver.get("https://www.autozone.com/motor-oil-and-transmission-fluid/engine-oil?pageNumber=1")
products = driver.find_elements_by_css_selector('.prodName')
prices = driver.find_elements_by_css_selector('.price[id*=retailpricediv]')

productList = []
priceList = []
for product, price in zip(products,prices):
    productList.append(product.text)
    priceList.append(price.text.split('\n')[0].replace('Price: ',''))

df = pd.DataFrame({'Product':productList,'Price':priceList})
print(df)

driver.quit()

Answer 3

您可以用不同的方式去皮。这是使用硒的另一种方法：

from selenium import webdriver
from contextlib import closing

with closing(webdriver.Chrome()) as driver:
    driver.get("https://www.autozone.com/external-engine/oil-filter?pageNumber=1")
    for items in driver.find_elements_by_css_selector("[typeof='Product']"):
        price = items.find_element_by_css_selector('.price > strong').text
        print(price)

输出：

$8.99
$8.99
$10.99
$8.99
$8.99

以此类推....

BeautifulSoup不会拾取单个标签

3 个答案: