确定网站是否为网上商店

时间:2019-03-28 19:34:24

标签: python python-3.x selenium web-scraping beautifulsoup

我正在尝试从网站列表中确定网站是否为网上商店。

似乎大多数 网络商店都具有:

  • a中带有单词“ cart”的href标记
  • 已分配给在班级名称中带有单词“ cart”的班级的li标记

我认为我必须利用正则表达式,然后告诉BeautifulSoup find方法在ali标签中搜索HTML数据以查找该正则表达式。我该怎么办?

到目前为止,以下代码在HTML数据中搜索a标签为href的购物车。

代码

import re
from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        cart = re.compile('.*cart.*', re.IGNORECASE)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        if soup.find('a', href=cart):
            shops.append(url)

print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)

输出:

SHOPS FOUND:
https://www.nike.com/
https://www.amazon.com/

1 个答案:

答案 0 :(得分:0)

您可以将contains *运算符与CSS属性选择器一起使用,以指定类属性或href属性包含子字符串cart。将两个(类和href)与Or语法结合使用。待办事项:您可以考虑添加等待条件,以确保首先显示所有lia标签元素。

from bs4 import BeautifulSoup
from selenium import webdriver

websites = [
    'https://www.nike.com/',
    'https://www.youtube.com/',
    'https://www.google.com/',
    'https://www.amazon.com/',
    'https://www.gamestop.com/'
]
shops = []

driver = webdriver.Chrome('chromedriver')
options = webdriver.ChromeOptions()
options.headless = True
options.add_argument('log-level=3')

with webdriver.Chrome(options=options) as driver:
    for url in websites:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        items = soup.select('a[href*=cart], li[class*=cart]')
        if len(items) > 0:
                shops.append(url)
print('\nSHOPS FOUND:')
for shop in shops:
    print(shop)