美丽的汤/硒网络抓取

时间:2020-12-20 13:27:16

标签: python selenium selenium-webdriver web-scraping beautifulsoup

我正在尝试从一个本地网站获取产品名称及其价格。

该网站是动态加载的,因此请求不支持它。我正在使用硒和美丽的汤。

但是它会重复计算每个产品(我得到了同一产品的 2 个链接),对此有什么解决方案吗?

此外,在获取产品链接后,我需要获取产品信息(例如名称和价格),但它再次对产品进行了双重计算,并且没有获取名称和价格。

我的代码:

import pandas as pd        
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
    
productlinks = []
baseurl = "https://www.technodom.kz/"
options = Options()
options.headless = True

driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe", options=options)


for x in range(1, 5):
    driver.get(
        f"https://www.technodom.kz/bytovaja-tehnika/uhod-za-odezhdoj/stiral-nye-mashiny/f/brands/lg/brands/samsung?page={x}"
    )
    # Wait for the page to fully render
    sleep(3)
    soup = BeautifulSoup(driver.page_source, "lxml")
    product_list = soup.find_all("li", class_="ProductCard")
    for item in product_list:
        for link in item.find_all("a", href=True):
            productlinks.append(baseurl + link["href"])
    print(productlinks)

wmlist = []
for link in productlinks:
    driver.get(link)
    soup = BeautifulSoup(driver.page_source, "lxml")
    print(link)
    name = soup.find('h1', class_='ProductHeader-Title').text.strip()
    price = soup.find('p', class_='ProductPrice ProductInformation-Price').text.strip()

    wm = {
        'Model':name,
        'Price': price
    }
    wmlist.append(wm)
    print('Saving:', wm['Model'])
df = pd.DataFrame(wmlist)

df.to_excel("TD pricesTEST.xlsx", sheet_name='TEW', index=False)

1 个答案:

答案 0 :(得分:1)

那些嵌套的循环将导致您的输出翻倍。此外,您需要的只是一个 <a> 标签,类别为 ProductCard-Content

我稍微简化了您的代码,以下是您获取产品名称、价格和链接并最终将它们转储到 Excel 文件的方法:

from time import sleep

import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)

final_output = []
pages = list(range(1, 5))

for page_number in pages:
    print(f"Scraping page: {page_number} / {len(pages)}")
    driver.get(
        f"https://www.technodom.kz/bytovaja-tehnika/uhod-za-odezhdoj/"
        f"stiral-nye-mashiny/f/brands/lg/brands/samsung?page={page_number}"
    )
    sleep(5)
    soup = BeautifulSoup(
        driver.page_source,
        "lxml",
    ).find_all("a", class_="ProductCard-Content")

    links = [f"https://www.technodom.kz/{anchor['href']}" for anchor in soup]
    names = [name.find("h4").getText() for name in soup]
    prices = [price.find("data")["value"] for price in soup]

    final_output.append(
        [
            [name, price, link] for name, price, link
            in zip(names, prices, links)
        ]
    )

df = pd.DataFrame(
    [data for sub_list in final_output for data in sub_list],
    columns=["NAME", "PRICE", "LINK"],
)
df.to_excel("test.xlsx", sheet_name='TEW', index=False)

输出:

enter image description here