我想通过使用python脚本来废弃由javascript函数创建的DIV内容。我尝试过使用BS4,并且通过这样做,我无法获得动态数据。相反,它只显示源代码。
示例代码:
import requests
from bs4 import BeautifulSoup
URL = "https://rawgit.com/skysoft999/tableauJS/master/example.html"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
for row in soup.findAll('div', attrs = {'class':'quote'}):
print(row)
print(soup.prettify())
示例HTML源代码位于Pastebin
要提取的样本数据:
答案 0 :(得分:1)
初始HTML不包含您想要抓取的数据,这就是为什么仅使用BeautifulSoup
是不够的。您可以使用Selenium
加载页面,然后抓取内容。
<强>代码:强>
import json
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
html = None
url = 'http://demo-tableau.bitballoon.com/'
selector = '#dataTarget > div'
delay = 10 # seconds
browser = webdriver.Chrome()
browser.get(url)
try:
# wait for button to be enabled
WebDriverWait(browser, delay).until(
EC.element_to_be_clickable((By.ID, 'getData'))
)
button = browser.find_element_by_id('getData')
button.click()
# wait for data to be loaded
WebDriverWait(browser, delay).until(
EC.presence_of_element_located((By.CSS_SELECTOR, selector))
)
except TimeoutException:
print('Loading took too much time!')
else:
html = browser.page_source
finally:
browser.quit()
if html:
soup = BeautifulSoup(html, 'lxml')
raw_data = soup.select_one(selector).text
data = json.loads(raw_data)
import pprint
pprint.pprint(data)
<强>输出:强>
[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},
{'formattedValue': '6/26/2010 3:00:00 AM', 'value': '2010-06-26 03:00:00'},
{'formattedValue': 'ALEX', 'value': 'ALEX'},
{'formattedValue': '16.70000', 'value': '16.7'},
{'formattedValue': '-84.40000', 'value': '-84.4'},
{'formattedValue': '30', 'value': '30'}],
...
]
代码假定该按钮最初被禁用:<button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button>
并且数据未自动加载,但由于单击了按钮。因此,您需要删除此行:setTimeout(function(){ getUnderlyingData(); }, 3000);
。
您可以在此处找到示例的工作演示:http://demo-tableau.bitballoon.com/。