Question

我正在尝试从网站列表中确定网站是否为网上商店。

似乎大多数网络商店都具有：

在a中带有单词“ cart”的href标记
已分配给在班级名称中带有单词“ cart”的班级的li标记

我认为我必须利用正则表达式，然后告诉BeautifulSoup find方法在a或li标签中搜索HTML数据以查找该正则表达式。我该怎么办？

到目前为止，以下代码在HTML数据中搜索a标签为href的购物车。

代码

import re
from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        cart = re.compile('.*cart.*', re.IGNORECASE)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        if soup.find('a', href=cart):
            shops.append(url)

print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

输出：

SHOPS FOUND:
https://www.nike.com/
https://www.amazon.com/

Answer 1

您可以将contains *运算符与CSS属性选择器一起使用，以指定类属性或href属性包含子字符串cart。将两个（类和href）与Or语法结合使用。待办事项：您可以考虑添加等待条件，以确保首先显示所有li和a标签元素。

from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        items = soup.select('a[href*=cart], li[class*=cart]')
        if len(items) > 0:
                shops.append(url)
print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

确定网站是否为网上商店

1 个答案: