我正在尝试从cars.com提取价格信息和h1标签信息,有一个下拉列表来搜索该页面。
我想选择其他型号并搜索价格。但是“模型”的选择取决于“制造”。我有使用硒的下拉菜单的所有组合。 对于每种下拉组合,如何获取H1信息,例如“ soup.find(“ H1”)“
代码如下
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome('C:/Users/chromedriver.exe')
driver.get('https://www.cars.com/')
time.sleep(4)
selectMake = Select(driver.find_element_by_name("makeId"))
time.sleep(2)
selectModel = Select(driver.find_element_by_name("modelId"))
data = []
for makesOption in selectMake.options:
makesText = makesOption.text
selectMake.select_by_visible_text(makesText)
time.sleep(1)
selectModel = Select(driver.find_element_by_name("modelId"))
for modelOption in selectModel.options:
modelText = modelOption.text
selectModel.select_by_visible_text(modelText)
data.append([makesText,modelText])
答案 0 :(得分:1)
您初始化Select
,但未选择任何内容,请找到有关如何使用select here的详细信息。
使用WebDriverWait,您可以等待元素的特定条件。在下面的代码中,我使用wait.until(EC.element_to_be_clickable((By.NAME, "makeId")))
而不是睡眠,其中Selenium将每0.5秒检查一次元素是否可单击,并在10秒钟内超时,一旦满足可点击条件,它将向前移动。
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
driver = webdriver.Chrome('C:/Users/chromedriver.exe')
wait = WebDriverWait(driver, 10)
driver.get('https://www.cars.com/')
select_make = Select(wait.until(EC.element_to_be_clickable((By.NAME, "makeId"))))
select_make.select_by_visible_text("BMW")
select_model = Select(wait.until(EC.element_to_be_clickable((By.NAME, "modelId"))))
select_model.select_by_visible_text("- M850 Gran Coupe")
答案 1 :(得分:1)
您需要的所有信息都包含在每个页面的源代码上的json
个对象中,幸运的是,javascript
不需要检索它们,因此,您不需要{{ 1}},它本质上很慢,您可以简单地使用selenium
来检索requests
对象并将其转换为json
对象,即:
python
cars.com主页上的json对象包含所有以ID表示的品牌和型号,即:
x = requests.get("https://cars.com")
if x.status_code == 200:
js_obj = re.findall("REDUX_STATE = (.*?)</script>", x.text, re.IGNORECASE | re.MULTILINE)
if js_obj:
j_obj = json.loads(js_obj[0]) # check the tree view of the object on notes
=生成ID mkId
=型号ID 利用这些信息,我们可以为所有品牌和型号构造搜索查询:
mdId
cars.com.py
https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=20773&mkId=20001&page=1&perPage=100
输出: (v_make,名称,v_price,v_mileage)
import requests, re, json
x = requests.get("https://cars.com")
if x.status_code == 200:
js_obj = re.findall("REDUX_STATE = (.*?)</script>", x.text, re.IGNORECASE | re.MULTILINE)
if js_obj:
j_obj = json.loads(js_obj[0])
for model in j_obj['home']['makeModels']['models'][:1]: # remove [:1] to parse all makes and models
mkId = model['makeId']
mdId = model['id']
label = model['label']
name = model['name']
#print(mkId, mdId, label, name)
# 20001 20773 CL CL
s_url = f"https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId={mdId}&mkId={mkId}&page=1&perPage=100"
s_page = requests.get(s_url)
if s_page.status_code == 200:
s_html = re.findall(r"CARS\.digitalData = (.*?);\s+</script>", s_page.text, re.IGNORECASE | re.MULTILINE)
if s_html:
s_obj = json.loads(s_html[0])
if "page" in s_obj:
if "vehicle" in s_obj['page']:
for v in s_obj["page"]['vehicle']:
v_price = v['price']
v_make = v['make']
v_mileage = v['mileage']
#...
print(v_make, name, v_price, v_mileage)
注意: