我正在尝试为以下页面设置网络抓取工具: https://www.autozone.com/external-engine/oil-filter?pageNumber=1
#connect and download html
data = 'https://www.autozone.com/motor-oil-and-transmission-fluid/engine-oil?pageNumber=1'
uclient = urlopen(data)
pagehtml= uclient.read()
uclient.close()
articles = bs(pagehtml,'html.parser')
#separate data by shop items
containers = articles.find_all('div',{'class' : 'shelfItem'})
但是,当我尝试获取价格时,什么也没找到:
containers[0].find_all('div',{'class':'price'})
...使用我的浏览器检查网站时,显示以下内容:
<div class="price" id="retailpricediv_663653_0" style="height: 85px;">Price: <strong>$8.99</strong><br>
我怎么能抢到这8.99美元?
谢谢
答案 0 :(得分:2)
您可以通过直接调用api获得所需的数据价格:
import requests
url = 'https://www.autozone.com/rest/bean/autozone/diy/commerce/pricing/PricingServices/retrievePriceAndAvailability?atg-rest-depth=2'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0'}
data = {'arg1': 6997, 'arg2':'', 'arg3': '663653,663636,663650,5531,663637,663639,644036,663658,663641,835241,663645,663642', 'arg4': ''}
response = requests.post(url, headers=headers, data=data).json()
for item in response['atgResponse']:
print(item['retailPrice'])
输出:
8.99
8.99
10.99
8.99
8.99
8.99
8.99
8.99
8.99
8.99
8.99
8.99
要创建data
字典,您需要将商店编号传递为arg1
,将每个商品ID的列表传递为arg3
。 ..
您一次可以获得arg1
的值,但是应该在每个页面上提取arg3
page_url = 'https://www.autozone.com/external-engine/oil-filter?pageNumber=1'
r = requests.get(page_url, headers=headers)
source = bs(r.text)
arg1 = source.find('div',{'id' : 'myStoreNum'}).text
arg3 = ",".join([_id['id'].strip('azid') for _id in source.find_all('div',{'class' : 'categorizedShelfItem'})])
因此,您现在可以定义data
而无需对值进行硬编码:
data = {'arg1': arg1, 'arg2':'', 'arg3': arg3, 'arg4': ''}
要从下一页获取值,只需将pageNumber=1
中的pageNumber=2
更改为page_url
-其余代码保持不变...
答案 1 :(得分:2)
我认为价格是由javascript加载的,因此需要像selenium这样的方法来确保值存在(或其他答案中所示的API调用!)
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome()
driver.get("https://www.autozone.com/motor-oil-and-transmission-fluid/engine-oil?pageNumber=1")
products = driver.find_elements_by_css_selector('.prodName')
prices = driver.find_elements_by_css_selector('.price[id*=retailpricediv]')
productList = []
priceList = []
for product, price in zip(products,prices):
productList.append(product.text)
priceList.append(price.text.split('\n')[0].replace('Price: ',''))
df = pd.DataFrame({'Product':productList,'Price':priceList})
print(df)
driver.quit()
答案 2 :(得分:1)
您可以用不同的方式去皮。这是使用硒的另一种方法:
from selenium import webdriver
from contextlib import closing
with closing(webdriver.Chrome()) as driver:
driver.get("https://www.autozone.com/external-engine/oil-filter?pageNumber=1")
for items in driver.find_elements_by_css_selector("[typeof='Product']"):
price = items.find_element_by_css_selector('.price > strong').text
print(price)
输出:
$8.99
$8.99
$10.99
$8.99
$8.99
以此类推....