Question

我正在创建一个Python程序，该程序使用网络抓取功能检查某件商品是否有库存。该代码是一个Python 3.9脚本，使用了Beautiful Soup 4，并要求抓取该商品的可用性。我最终希望使该程序搜索多个网站和每个站点内的多个链接，这样我就不必一次运行大量脚本。该程序的预期结果是：
200
0
In Stock
但是我得到了：
200
[]
Out Of Stock

“ 200”表示代码是否可以访问服务器，200是预期结果。 “ 0”是一个布尔值，用于查看该项目是否有库存，对于有库存，预期响应为“ 0”。我既给了库存商品又给了缺货商品，它们都给出了200 [] Out Of Stock相同的响应。我觉得out_of_stock_divs中的def check_item_in_stock出了问题，因为这就是我得到[]的结果，它找到了该物品的可用性

我昨天早些时候使代码正常工作，并且我不断添加功能（例如它刮擦了多个链接和不同的网站），但后来又打破了，我无法使其恢复正常工作

这是程序代码。（我确实是根据Arya Boudaie先生在他的网站上的代码创建此代码的，https://aryaboudaie.com/我摆脱了他的文本通知，尽管因为我打算只在我旁边的备用计算机上运行它，然后让它播放大声的声音，稍后会实现。）

from bs4 import BeautifulSoup
import requests

def get_page_html(url):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
    page = requests.get(url, headers=headers)
    print(page.status_code)
    return page.content


def check_item_in_stock(page_html):
    soup = BeautifulSoup(page_html, 'html.parser')
    out_of_stock_divs = soup.findAll("text", {"class": "product-inventory"})
    print(out_of_stock_divs)
    return len(out_of_stock_divs) != 0

def check_inventory():
    url = "https://www.newegg.com/hp-prodesk-400-g5-nettop-computer/p/N82E16883997492?Item=9SIA7ABC996974"
    page_html = get_page_html(url)
    if check_item_in_stock(page_html):
        print("In stock")
    else:
        print("Out of stock")

while True:
    check_inventory()
    time.sleep(60)```

Answer 1

产品库存状态位于<div>标签内，而不是<text>标签内：

import requests
from bs4 import BeautifulSoup


def get_page_html(url):
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36"}
    page = requests.get(url, headers=headers)
    print(page.status_code)
    return page.content


def check_item_in_stock(page_html):
    soup = BeautifulSoup(page_html, 'html.parser')
    out_of_stock_divs = soup.findAll("div", {"class": "product-inventory"})  # <--- change "text" to div
    print(out_of_stock_divs)
    return len(out_of_stock_divs) != 0

def check_inventory():
    url = "https://www.newegg.com/hp-prodesk-400-g5-nettop-computer/p/N82E16883997492?Item=9SIA7ABC996974"
    page_html = get_page_html(url)
    if check_item_in_stock(page_html):
        print("In stock")
    else:
        print("Out of stock")

check_inventory()

打印：

200
[<div class="product-inventory"><strong>In stock.</strong></div>]
In stock

注意：该站点的HTML标记过去可能已更改，我将修改check_item_in_stock函数：

def check_item_in_stock(page_html):
    soup = BeautifulSoup(page_html, 'html.parser')
    out_of_stock_div = soup.find("div", {"class": "product-inventory"})
    return out_of_stock_div.text == "In stock."

Answer 2

您可能可以使用 lxml 库以一种非常易读且稍微优雅的方式完成跑腿工作：

import config
import requests
from lxml import html

def in_stock(url: str = config.upstream_url) -> tuple:
    """ Check the website for stock status """
    page = requests.get(url, headers={'User-agent': config.user_agent})
    proc_html = html.fromstring(page.text)
    checkout_button = proc_html.get_element_by_id('addToCart')
    return (page.status, not ('disabled' in checkout_button.attrib['class']))

我建议使用 xpath 来标识页面上要检查的元素。这使它在上游网站更新（超出您的控制范围）时Easy to Change，因为您只需要调整 xpath 字符串以反映上游更改：

# change me, if upstream web content changes
xpath_selector = r'''///button[@id='addToCart']'''
checkout_button = proc_html.xpath(xpath_selector)[0]

顺便说一句，从风格上讲，一些纯粹主义者会建议在编写函数时避免副作用（即在函数中使用 print()）。您可以返回带有状态代码和结果的元组。这是 Python 中一个非常好的特性。

使用网页抓取功能检查商品是否有库存

2 个答案: