从2下拉菜单中检索表格数据

时间:2020-06-08 12:41:08

标签: python selenium beautifulsoup web-crawler

我有这个网站:https://www.adbc.gov.ae/BusinessActivityInfo/BusinessActivity.aspx?culture=en-US

enter image description here

此网站有2个下拉菜单:CategorySubCategory。选择CategorySubCategory后,它将显示一个表格,不同的CategorySubCategory将显示一个表格。如何为每个CategorySubCategory抓取该表。

这是我到目前为止尝试的:

url = 'https://www.adbc.gov.ae/BusinessActivityInfo/BusinessActivity.aspx?culture=en-US'

req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")

content = soup.find("select",{"name":"ddlNatureId"})
options = content.find_all("option")
options1 = [y.text for y in options]
options1

输出:

['',
 'ADVOCATE OFFICES',
 'AGENCIES',
 'AGRICULTURE',
 'AGRICULTURE, LIVESTOCK AND FISHERIES ACTIVITIES',
 'ANIMAL HUSBANDRY',
 'ANIMAL SHELTERING SERVICES',
 'ART GALLERY',
 'AUDITING OFFICES',
 'BAKERIES AND SWEETS',
...
]

更新

这是我到目前为止所得到的。我发现使用Selenium选择下拉列表的值。这是我的代码:

一些库:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.support.ui import Select
import time
import sys
from bs4 import BeautifulSoup
import requests

设置webdriver

url = 'https://www.adbc.gov.ae/BusinessActivityInfo/BusinessActivity.aspx?culture=en-US'
chrome_driver_path = 'D:\\work\\crawl data\\selenium_project\\chromedriver.exe'

chrome_options = Options()
chrome_options.add_argument('--headless')

webdriver = webdriver.Chrome(
  executable_path=chrome_driver_path, options=chrome_options
)

加载网站并抓取数据代码:

with webdriver as driver:
    # Set timeout time 
    wait = WebDriverWait(driver, 10)

    # retrive url in headless browser
    driver.get(url)

    # find select box
    search = Select(driver.find_element_by_id("ddlNatureId"))
    search.select_by_value('ADVOCATE OFFICES')

    req = requests.get(url)
    soup = BeautifulSoup(req.text, "lxml")

    price=soup.find("select",{"name":"ddlSubCategId"})
    options = price.find_all("option")
    options1 = [y.text for y in options]

    driver.close()

print(options1)

输出:

[]

预期的输出(应该是SubCategoryCategory的{​​{1}}的列表):

'ADVOCATE OFFICES'

我现在的问题是,当我选择['', 'Advertising Agent', 'Advocate Offices', 'Agricultural Equipment And Tools Rental', 'Air Transport', 'Agents', ... ] 时无法获取SubCategory的数据。我该如何解决这个问题?

0 个答案:

没有答案