我正在尝试从一个本地网站获取产品名称及其价格。
该网站是动态加载的,因此请求不支持它。我正在使用硒和美丽的汤。
但是它会重复计算每个产品(我得到了同一产品的 2 个链接),对此有什么解决方案吗?
此外,在获取产品链接后,我需要获取产品信息(例如名称和价格),但它再次对产品进行了双重计算,并且没有获取名称和价格。
我的代码:
import pandas as pd
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
productlinks = []
baseurl = "https://www.technodom.kz/"
options = Options()
options.headless = True
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe", options=options)
for x in range(1, 5):
driver.get(
f"https://www.technodom.kz/bytovaja-tehnika/uhod-za-odezhdoj/stiral-nye-mashiny/f/brands/lg/brands/samsung?page={x}"
)
# Wait for the page to fully render
sleep(3)
soup = BeautifulSoup(driver.page_source, "lxml")
product_list = soup.find_all("li", class_="ProductCard")
for item in product_list:
for link in item.find_all("a", href=True):
productlinks.append(baseurl + link["href"])
print(productlinks)
wmlist = []
for link in productlinks:
driver.get(link)
soup = BeautifulSoup(driver.page_source, "lxml")
print(link)
name = soup.find('h1', class_='ProductHeader-Title').text.strip()
price = soup.find('p', class_='ProductPrice ProductInformation-Price').text.strip()
wm = {
'Model':name,
'Price': price
}
wmlist.append(wm)
print('Saving:', wm['Model'])
df = pd.DataFrame(wmlist)
df.to_excel("TD pricesTEST.xlsx", sheet_name='TEW', index=False)
答案 0 :(得分:1)
那些嵌套的循环将导致您的输出翻倍。此外,您需要的只是一个 <a>
标签,类别为 ProductCard-Content
。
我稍微简化了您的代码,以下是您获取产品名称、价格和链接并最终将它们转储到 Excel 文件的方法:
from time import sleep
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
final_output = []
pages = list(range(1, 5))
for page_number in pages:
print(f"Scraping page: {page_number} / {len(pages)}")
driver.get(
f"https://www.technodom.kz/bytovaja-tehnika/uhod-za-odezhdoj/"
f"stiral-nye-mashiny/f/brands/lg/brands/samsung?page={page_number}"
)
sleep(5)
soup = BeautifulSoup(
driver.page_source,
"lxml",
).find_all("a", class_="ProductCard-Content")
links = [f"https://www.technodom.kz/{anchor['href']}" for anchor in soup]
names = [name.find("h4").getText() for name in soup]
prices = [price.find("data")["value"] for price in soup]
final_output.append(
[
[name, price, link] for name, price, link
in zip(names, prices, links)
]
)
df = pd.DataFrame(
[data for sub_list in final_output for data in sub_list],
columns=["NAME", "PRICE", "LINK"],
)
df.to_excel("test.xlsx", sheet_name='TEW', index=False)
输出: