Web抓取下拉列表提取值的所有组合

时间:2020-02-21 20:05:20

标签: python selenium web-scraping

我正在尝试从cars.com提取价格信息和h1标签信息,有一个下拉列表来搜索该页面。

我想选择其他型号并搜索价格。但是“模型”的选择取决于“制造”。我有使用硒的下拉菜单的所有组合。 对于每种下拉组合,如何获取H1信息,例如“ soup.find(“ H1”)“

代码如下

from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time

driver = webdriver.Chrome('C:/Users/chromedriver.exe')
driver.get('https://www.cars.com/')
time.sleep(4)

selectMake = Select(driver.find_element_by_name("makeId"))


time.sleep(2)


selectModel = Select(driver.find_element_by_name("modelId"))

data = []
for makesOption in selectMake.options:
    makesText = makesOption.text
    selectMake.select_by_visible_text(makesText)
    time.sleep(1)
    selectModel = Select(driver.find_element_by_name("modelId"))
    for modelOption in selectModel.options:
        modelText = modelOption.text
        selectModel.select_by_visible_text(modelText)
        data.append([makesText,modelText])

2 个答案:

答案 0 :(得分:1)

您初始化Select,但未选择任何内容,请找到有关如何使用select here的详细信息。

使用WebDriverWait,您可以等待元素的特定条件。在下面的代码中,我使用wait.until(EC.element_to_be_clickable((By.NAME, "makeId")))而不是睡眠,其中Selenium将每0.5秒检查一次元素是否可单击,并在10秒钟内超时,一旦满足可点击条件,它将向前移动。

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Chrome('C:/Users/chromedriver.exe')
wait = WebDriverWait(driver, 10)

driver.get('https://www.cars.com/')

select_make = Select(wait.until(EC.element_to_be_clickable((By.NAME, "makeId"))))
select_make.select_by_visible_text("BMW")

select_model = Select(wait.until(EC.element_to_be_clickable((By.NAME, "modelId"))))
select_model.select_by_visible_text("- M850 Gran Coupe")

答案 1 :(得分:1)

您需要的所有信息都包含在每个页面的源代码上的json个对象中,幸运的是,javascript不需要检索它们,因此,您不需要{{ 1}},它本质上很慢,您可以简单地使用selenium来检索requests对象并将其转换为json对象,即:

python

cars.com主页上的json对象包含所有以ID表示的品牌和型号,即:

  • x = requests.get("https://cars.com") if x.status_code == 200: js_obj = re.findall("REDUX_STATE = (.*?)</script>", x.text, re.IGNORECASE | re.MULTILINE) if js_obj: j_obj = json.loads(js_obj[0]) # check the tree view of the object on notes =生成ID
  • mkId =型号ID

利用这些信息,我们可以为所有品牌和型号构造搜索查询:

mdId


cars.com.py

https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId=20773&mkId=20001&page=1&perPage=100

输出: (v_make,名称,v_price,v_mileage)

import requests, re, json

x = requests.get("https://cars.com")
if x.status_code == 200:
    js_obj = re.findall("REDUX_STATE = (.*?)</script>", x.text, re.IGNORECASE | re.MULTILINE)
    if js_obj:
        j_obj = json.loads(js_obj[0])
        for model in j_obj['home']['makeModels']['models'][:1]: # remove [:1] to parse all makes and models
            mkId = model['makeId']
            mdId = model['id']
            label = model['label']
            name = model['name']

            #print(mkId, mdId, label, name)
            # 20001 20773 CL CL

            s_url = f"https://www.cars.com/for-sale/searchresults.action/?dealerType=all&mdId={mdId}&mkId={mkId}&page=1&perPage=100"

            s_page = requests.get(s_url)
            if s_page.status_code == 200:

                s_html = re.findall(r"CARS\.digitalData = (.*?);\s+</script>", s_page.text, re.IGNORECASE | re.MULTILINE)
                if s_html:
                    s_obj = json.loads(s_html[0])
                    if "page" in s_obj:
                        if "vehicle" in s_obj['page']:
                            for v in s_obj["page"]['vehicle']:
                                v_price = v['price']
                                v_make = v['make']
                                v_mileage = v['mileage']
                                #...
                                print(v_make, name, v_price, v_mileage)

注意: