我正在尝试使用python(Requests和BeautifulSoup4库以及Selenium)抓取数据
当我试图从某些网站上获取一些数据后,经过一段时间的延迟后,数据被加载,它返回一个空值。我了解要完成此任务,我必须使用WebDriverWait。
import requests
from bs4 import BeautifulSoup
# selenium imports
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
# Initialize a Chrome webdriver
driver = webdriver.Chrome()
# Grab the web page
driver.get("http://")
# use selenium.webdriver.support.ui.Select
# that we imported above to grab the Select element called
# lmStatType, then select the first value
# We will use .find_element_by_name here because we know the name
dropdown = Select(driver.find_element_by_name("lmStatType"))
dropdown.select_by_value("1")
# select the year 2560
dropdown = Select(driver.find_element_by_name("lmYear"))
dropdown.select_by_value("60")
# Now we can grab the search button and click it
search_button = driver.find_elements_by_xpath("//*[contains(text(), 'ตกลง')]"[0]
search_button.click()
# we just look at .page_source of the driver
driver.page_source
# We can feed that into Beautiful Soup
doc = BeautifulSoup(driver.page_source, "html.parser")
# It's a tricky table, also tried with class names
rows = doc.find('table', id='datatable')
print(rows) # returns empty
在上面的示例中,即使我尝试了几种解决方法,我也没有将已尝试的选项与selenium webdriver wait&timeout相关的语句一起使用来逐步了解它。
此外,只需尝试分别获取地区级别的数据即可(但无法找出确切的类/ id)
url = 'http://'
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for tr in soup.find(class_="display").find_all("tr"):
data = [item.get_text(strip=True) for item in tr.find_all(["th","td"])]
print(data)
感谢您的帮助。提前致谢。抱歉,如果这是一个重复的问题。
答案 0 :(得分:0)
正如我在评论中所述,html实际上为您提供了获取数据的端点。从一开始,使用请求获取数据实际上很容易。
在您的html中显示为:“ sAjaxSource”:“ ../ datasource / showStatProvince.php?statType = 1&year = 60”。这是站点使用的端点。因此,您只需要返回站点url结构的上一级,而使用“ / datasource / ....”代替
看看:
import requests
from bs4 import BeautifulSoup
import re
url = "http://stat.bora.dopa.go.th/stat/statnew/statTDD/datasource/showStatDistrict.php?statType=1&year=60&rcode=10"
r = requests.get(url)
# endpoint returns json
data = r.json()
aaData = data['aaData']
# this base url for viewing the details for each hit
view_url = "http://stat.bora.dopa.go.th/stat/statnew/statTDD/views/showZoneData.php"
# the first line in the dataset is actually html
# we convert this to plain text
html_header = aaData[0]
html_header_stripped = [BeautifulSoup(e, "html.parser").get_text() for e in html_header]
# and the insert this html_header_stripped in the aaData-list
aaData[0] = html_header_stripped
for rcode, region_html,male,female, total, home in aaData:
# first element is "<font color='red'><b>กรุงเทพมหานคร</b></font>"
# secound is "<a href=javascript:openWindow('?rcode=1001&statType=1&year=60')>ท้องถิ่นเขตพระนคร</a>"
# we need to extract the link that opens in new window to be able to iterate further
soup = BeautifulSoup(region_html, "html.parser")
region = soup.get_text()
try:
link_element = soup.find('a').get('href')
rel_link = re.search("\('(.+)'\)", link_element).group(1)
abs_link = view_url + rel_link
except:
abs_link = None
print("{};{};{};{};{};{};{}".format(rcode, region, abs_link,male,female,total,home))
这里我要打印结果,但是说您想跟踪链接并获取数据,则可以将结果存储在字典列表中,然后对其进行迭代或在for循环中进行。