任何人都可以提出一种方法来抓取package.json
标签中的数据,特别是在这种情况下,是AEMO(https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html)的30分钟表格。
要获取数据表,我将需要单击该按钮以在网站上显示该表或单击下载按钮。但是,这里的障碍是,当我尝试使用Selenium进行抓取时,按钮和表格的文本隐藏在<script>
标记后面。
到目前为止,这是我的代码:
<script>
部分结果是:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
url = "https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html"
browser = webdriver.Safari(executable_path='/usr/bin/safaridriver')
browser.get(url)
try:
print(browser.page_source)
except:
print("not found")
finally:
browser.quit()
答案 0 :(得分:0)
硒有自己的定位元素的方式,例如find_element_by_css_selector。通常,浏览器需要一些时间来呈现元素,因此您可能需要使用WebdriverWait。
以下是从页面提取现货价格的示例:
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
url = 'https://www.aemo.com.au/aemo/apps/visualisations/elec-nem-priceanddemand.html'
browser = webdriver.Chrome()
browser.get(url)
sel = 'body > div > compose > div > compose.fill-height.flex-container.au-target > compose > div > div:nth-child(1) > div'
element = WebDriverWait(browser, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, sel))
)
print(element.text)
结果
$92.02/MWh