我正在尝试使用硒和漂亮的汤来解析表,但在定位和吸引全班同学的价值方面遇到了问题。看来每一列都有相同的类名,这使其变得更加困难。这是试图解析的html代码的一部分:
这是表格的外观:
到目前为止,我的编码是:
Option Explicit
因此,基本上,我打开Chrom浏览器,加载要查找的项目的页面,然后查找所有名为“ col-6 specs__cell specs__cell--label”的类,并尝试从出现的第一个文本中获取文本。我正在尝试解决所有5个维度及其值的问题。
执行代码时出现此错误:
driver = webdriver.Chrome()
driver.get(base_url)
driver.implicitly_wait(100)
driver.find_elements_by_class_name("plp-pod__image")[0].click()
first = driver.find_elements_by_class_name("col-6 specs__cell specs__cell--label")[0].getText()
first
您知道如何解析这些元素以将所有5个维度及其值都转换为pandas数据框吗?
我尝试将您的两个建议合并如下:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-27-2e124acf6be5> in <module>
3 driver.implicitly_wait(100)
4 driver.find_elements_by_class_name("plp-pod__image")[0].click()
----> 5 first = driver.find_elements_by_class_name("col-6 specs__cell specs__cell--label")[0].getText()
IndexError: list index out of range
然后我进入测试所用的网页,转到测试所用的项目,但是它似乎无法读取正确的类,因此会出现此错误:
from selenium.common.exceptions import NoSuchElementException,
NoSuchFrameException
i = "Marshalltown PT164BR"
base_url = f"https://www.homedepot.com/s/" + i +"?NCNI-5"
driver = webdriver.Chrome()
driver.get(base_url)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR,
".plp-pod__image"))).click()
#%%
groups = driver.find_elements_by_class_name("specs__group")
data = {}
for group in groups:
if "placeholder" not in group.get_attribute("class"):
specs = group.find_elements_by_class_name("specs__cell")
dimension = specs[0].text.strip()
value = float(specs[1].text.replace("in","").strip())
#print(dimension,":",value)
if dimension not in data:
data[dimension] = []
data[dimension].append(value)
print(data)
data_frame = pd.DataFrame(data=data)
print(data_frame)
答案 0 :(得分:2)
除了上一篇文章之外,如果我使用此HTML:
<html>
<head></head>
<body>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
<div class="col-6 specs__cell specs__cell--label">Blade Length (in.)</div>
<div class="col-6 specs__cell">16 in</div>
</div>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
<div class="col-6 specs__cell specs__cell--label">Blade Width (in.)</div>
<div class="col-6 specs__cell">4.5</div>
</div>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
<div class="col-6 specs__cell specs__cell--label">Product Height (in.)</div>
<div class="col-6 specs__cell">3.63 in</div>
</div>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
<div class="col-6 specs__cell specs__cell--label">Product Length (in.)</div>
<div class="col-6 specs__cell">16 in</div>
</div>
<div class="specs__group col-12 col-lg-6" style="min-height: 39px;">
<div class="col-6 specs__cell specs__cell--label">Product Width (in.)</div>
<div class="col-6 specs__cell">4.5 in</div>
<div class="specs__group placeholder" style="min-height: 39px;">
??
</div>
</body>
您可以创建字典或数据框:
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException, NoSuchFrameException
base_url = "file:///C:/Users/.../blade.html"
driver = webdriver.Chrome()
driver.get(base_url)
groups = driver.find_elements_by_class_name("specs__group")
data = {}
for group in groups:
if "placeholder" not in group.get_attribute("class"):
specs = group.find_elements_by_class_name("specs__cell")
dimension = specs[0].text.strip()
value = float(specs[1].text.replace("in","").strip())
#print(dimension,":",value)
if dimension not in data:
data[dimension] = []
data[dimension].append(value)
print(data)
data_frame = pd.DataFrame(data=data)
print(data_frame)
答案 1 :(得分:1)
这里的代码将获取产品的尺寸。
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome()
i = "Marshalltown PT164BR"
base_url ="https://www.homedepot.com/s/" + i +"?NCNI-5"
driver.get(base_url)
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".plp-pod__image"))).click()
Dimensions_Type=[]
Dimention_Size=[]
elements=WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.XPATH, "(//h4[text()='Dimensions']/following::div[contains(@class,'specs__table')])[1]/div")))
for ele in elements:
if "placeholder" not in ele.get_attribute("class"):
DimensionsType=ele.find_element_by_xpath(".//div[@class='col-6 specs__cell specs__cell--label']").get_attribute("textContent")
DimentionSize=ele.find_element_by_xpath(".//div[@class='col-6 specs__cell specs__cell--label']/following-sibling::div[1]").get_attribute("textContent")
Dimensions_Type.append(DimensionsType)
Dimention_Size.append(DimentionSize)
df=pd.DataFrame({"DimensionSize":Dimention_Size,"DimensionType":Dimensions_Type})
print(df)
控制台上的输出:
DimensionSize DimensionType
0 16 in Blade Length (in.)
1 4.5 Blade Width (in.)
2 3.63 in Product Height (in.)
3 16 in Product Length (in.)
4 4.5 in Product Width (in.)