Python - 沃尔玛的类别名称网页抓取

时间:2021-04-28 20:17:24

标签: python selenium web-scraping

我正在尝试从这家 Walmart link 获取部门名称。可以看到,首先Departments里面左边有7个部门(巧克力曲奇、曲奇、黄油曲奇……)。当我点击 See All Departments 时,又添加了 9 个类别,所以现在数字是 16。我正在尝试自动获取所有 16 个部门。我写了这段代码;

from selenium import webdriver

n_links = []

driver = webdriver.Chrome(executable_path='D:/Desktop/demo/chromedriver.exe')
url = "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391" 
driver.get(url)

search = driver.find_element_by_xpath("//*[@id='Departments']/div/div/ul").text
driver.find_element_by_xpath("//*[@id='Departments']/div/div/button/span").click()
search2 = driver.find_element_by_xpath("//*[@id='Departments']/div/div/div/div").text

sep = search.split('\n')
sep2 = search2.split('\n')

lngth = len(sep)
lngth2 = len(sep2)

for i in range (1,lngth):
    path = "//*[@id='Departments']/div/div/ul/li"+"["+ str(i) + "]/a"
    nav_links = driver.find_element_by_xpath(path).get_attribute('href')
    n_links.append(nav_links)
    
for i in range (1,lngth2):
    path = "//*[@id='Departments']/div/div/div/div/ul/li"+"["+ str(i) + "]/a"
    nav_links2 = driver.find_element_by_xpath(path).get_attribute('href')
    n_links.append(nav_links2)   
    
print(n_links)
print(len(n_links))

当我运行代码时,最后我可以看到 n_links 数组中的链接。但问题是;有时它有 13 个链接,有时有 14 个。它应该是 16 个,我还没有看到 16 个,只有 13 个或 14 个。我尝试在 time.sleep(3) 行之前添加 search2,但没有用。你能帮我吗?

4 个答案:

答案 0 :(得分:0)

我认为你让这件事变得比现在更复杂。你是对的,如果你点击按钮,你可能需要等待才能获得部门。

# This code will get all the departments shown
    departments = []
    departments = driver.find_elements_by_xpath("//li[contains(@class,'department')]") 
 
# Click on the show all departments button
    driver.find_element_by_xpath("//button[@data-automation-id='button']//span[contains(text(),'all Departments')]").click()

# Will get the departments shown
    departments = driver.find_elements_by_xpath("//li[contains(@class,'department')]")
    
# Iterate through the departments
for d in departments:
            print(d)

答案 1 :(得分:0)

要打印所有产品 (16),您可以尝试使用 CSS 选择器搜索它们:.collapsible-content > ul a, .sometimes-shown a

在您的示例中:

from selenium import webdriver

driver = webdriver.Chrome()
url = (
    "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"
)
driver.get(url)

search = driver.find_element_by_xpath("//*[@id='Departments']/div/div/ul").text
driver.find_element_by_xpath("//*[@id='Departments']/div/div/button/span").click()

all_departments = [
    link.get_attribute("href")
    for link in driver.find_elements_by_css_selector(
        ".collapsible-content > ul a, .sometimes-shown a"
    )
]

print(len(all_departments))
print(all_departments)

输出:

16
['https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138', 'https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066', 'https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640', 'https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949', 'https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757', 'https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302', 'https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825', 'https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095', 'https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193', 'https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971', 'https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174', 'https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932', 'https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167', 'https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659', 'https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535', 'https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359']

答案 2 :(得分:0)

仅使用 beautifulsoup

import json
import requests
from bs4 import BeautifulSoup

url = "https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0",
    "Accept-Language": "en-US,en;q=0.5",
}

soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
data = json.loads(soup.select_one("#searchContent").contents[0])

# uncomment to see all data:
# print(json.dumps(data, indent=4))


def find_departments(data):
    if isinstance(data, dict):
        if "name" in data and data["name"] == "Departments":
            yield data
        else:
            for v in data.values():
                yield from find_departments(v)
    elif isinstance(data, list):
        for v in data:
            yield from find_departments(v)


departments = next(find_departments(data), {})

for d in departments.get("values", []):
    print(
        "{:<30} {}".format(
            d["name"], "https://www.walmart.com" + d["baseSeoURL"]
        )
    )

打印:

Chocolate Cookies              https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138
Cookies                        https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066
Butter Cookies                 https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640
Shortbread Cookies             https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949
Coconut Cookies                https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757
Healthy Cookies                https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302
Keebler Cookies                https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825
Biscotti                       https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095
Gluten-Free Cookies            https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193
Molasses Cookies               https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971
Peanut Butter Cookies          https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174
Pepperidge Farm Cookies        https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932
Snickerdoodle Cookies          https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167
Sugar-Free Cookies             https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659
Tate's Cookies                 https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535
Vegan Cookies                  https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359

答案 3 :(得分:0)

为什么不使用 .visibility_of_all_elements_located

texts = []
links =[]

driver.get('https://www.walmart.com/browse/snacks-cookies-chips/cookies/976759_976787_1001391')
wait = WebDriverWait(driver, 60)
wait.until(EC.element_to_be_clickable((By.XPATH, "//span[text()='See all Departments']/parent::button"))).click()
elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "li.department-single-level a")))
for element in elements:
    #to get text
    texts.append(element.text)
    #to get link by attribute name
    links.append(element.get_attribute('href'))
    
print(texts)
print(links)

控制台输出:

[u'Chocolate Cookies', u'Cookies', u'Butter Cookies', u'Shortbread Cookies', u'Coconut Cookies', u'Healthy Cookies', u'Keebler Cookies', u'Biscotti', u'Gluten-Free Cookies', u'Molasses Cookies', u'Peanut Butter Cookies', u'Pepperidge Farm Cookies', u'Snickerdoodle Cookies', u'Sugar-Free Cookies', u"Tate's Cookies", u'Vegan Cookies']
[u'https://www.walmart.com/browse/food/chocolate-cookies/976759_976787_1001391_4007138', u'https://www.walmart.com/browse/food/cookies/976759_976787_1001391_8331066', u'https://www.walmart.com/browse/food/butter-cookies/976759_976787_1001391_7803640', u'https://www.walmart.com/browse/food/shortbread-cookies/976759_976787_1001391_8026949', u'https://www.walmart.com/browse/food/coconut-cookies/976759_976787_1001391_6970757', u'https://www.walmart.com/browse/food/healthy-cookies/976759_976787_1001391_7466302', u'https://www.walmart.com/browse/food/keebler-cookies/976759_976787_1001391_3596825', u'https://www.walmart.com/browse/food/biscotti/976759_976787_1001391_2224095', u'https://www.walmart.com/browse/food/gluten-free-cookies/976759_976787_1001391_4362193', u'https://www.walmart.com/browse/food/molasses-cookies/976759_976787_1001391_3338971', u'https://www.walmart.com/browse/food/peanut-butter-cookies/976759_976787_1001391_6460174', u'https://www.walmart.com/browse/food/pepperidge-farm-cookies/976759_976787_1001391_2410932', u'https://www.walmart.com/browse/food/snickerdoodle-cookies/976759_976787_1001391_8926167', u'https://www.walmart.com/browse/food/sugar-free-cookies/976759_976787_1001391_5314659', u'https://www.walmart.com/browse/food/tate-s-cookies/976759_976787_1001391_9480535', u'https://www.walmart.com/browse/food/vegan-cookies/976759_976787_1001391_8007359']

需要以下导入:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC