我有这个网站:https://www.adbc.gov.ae/BusinessActivityInfo/BusinessActivity.aspx?culture=en-US
此网站有2个下拉菜单:Category
和SubCategory
。选择Category
和SubCategory
后,它将显示一个表格,不同的Category
和SubCategory
将显示一个表格。如何为每个Category
和SubCategory
抓取该表。
这是我到目前为止尝试的:
url = 'https://www.adbc.gov.ae/BusinessActivityInfo/BusinessActivity.aspx?culture=en-US'
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
content = soup.find("select",{"name":"ddlNatureId"})
options = content.find_all("option")
options1 = [y.text for y in options]
options1
输出:
['',
'ADVOCATE OFFICES',
'AGENCIES',
'AGRICULTURE',
'AGRICULTURE, LIVESTOCK AND FISHERIES ACTIVITIES',
'ANIMAL HUSBANDRY',
'ANIMAL SHELTERING SERVICES',
'ART GALLERY',
'AUDITING OFFICES',
'BAKERIES AND SWEETS',
...
]
更新:
这是我到目前为止所得到的。我发现使用Selenium
选择下拉列表的值。这是我的代码:
一些库:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from selenium.webdriver.support.ui import Select
import time
import sys
from bs4 import BeautifulSoup
import requests
设置webdriver
:
url = 'https://www.adbc.gov.ae/BusinessActivityInfo/BusinessActivity.aspx?culture=en-US'
chrome_driver_path = 'D:\\work\\crawl data\\selenium_project\\chromedriver.exe'
chrome_options = Options()
chrome_options.add_argument('--headless')
webdriver = webdriver.Chrome(
executable_path=chrome_driver_path, options=chrome_options
)
加载网站并抓取数据代码:
with webdriver as driver:
# Set timeout time
wait = WebDriverWait(driver, 10)
# retrive url in headless browser
driver.get(url)
# find select box
search = Select(driver.find_element_by_id("ddlNatureId"))
search.select_by_value('ADVOCATE OFFICES')
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
price=soup.find("select",{"name":"ddlSubCategId"})
options = price.find_all("option")
options1 = [y.text for y in options]
driver.close()
print(options1)
输出:
[]
预期的输出(应该是SubCategory
是Category
的{{1}}的列表):
'ADVOCATE OFFICES'
我现在的问题是,当我选择['',
'Advertising Agent',
'Advocate Offices',
'Agricultural Equipment And Tools Rental',
'Air Transport',
'Agents',
...
]
时无法获取SubCategory
的数据。我该如何解决这个问题?