使用Python使用Javascript创建的动态内容

时间:2018-04-20 10:04:45

标签: python arrays python-3.x web-scraping beautifulsoup

我想通过使用python脚本来废弃由javascript函数创建的DIV内容。我尝试过使用BS4,并且通过这样做,我无法获得动态数据。相反,它只显示源代码。

示例代码:

import requests
from bs4 import BeautifulSoup

URL = "https://rawgit.com/skysoft999/tableauJS/master/example.html"
r = requests.get(URL)

soup = BeautifulSoup(r.content, 'html5lib')


for row in soup.findAll('div', attrs = {'class':'quote'}):
    print(row)


print(soup.prettify())

示例HTML源代码位于Pastebin

要提取的样本数据:

enter image description here

1 个答案:

答案 0 :(得分:1)

初始HTML不包含您想要抓取的数据,这就是为什么仅使用BeautifulSoup是不够的。您可以使用Selenium加载页面,然后抓取内容。

<强>代码:

import json

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

html = None
url = 'http://demo-tableau.bitballoon.com/'
selector = '#dataTarget > div'
delay = 10  # seconds

browser = webdriver.Chrome()
browser.get(url)

try:
    # wait for button to be enabled
    WebDriverWait(browser, delay).until(
        EC.element_to_be_clickable((By.ID, 'getData'))
    )
    button = browser.find_element_by_id('getData')
    button.click()

    # wait for data to be loaded
    WebDriverWait(browser, delay).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, selector))
    )
except TimeoutException:
    print('Loading took too much time!')
else:
    html = browser.page_source
finally:
    browser.quit()

if html:
    soup = BeautifulSoup(html, 'lxml')
    raw_data = soup.select_one(selector).text
    data = json.loads(raw_data)

    import pprint
    pprint.pprint(data)

<强>输出:

[[{'formattedValue': 'Atlantic', 'value': 'Atlantic'},
  {'formattedValue': '6/26/2010 3:00:00 AM', 'value': '2010-06-26 03:00:00'},
  {'formattedValue': 'ALEX', 'value': 'ALEX'},
  {'formattedValue': '16.70000', 'value': '16.7'},
  {'formattedValue': '-84.40000', 'value': '-84.4'},
  {'formattedValue': '30', 'value': '30'}],
  ...
]

代码假定该按钮最初被禁用:<button id="getData" onclick="getUnderlyingData()" disabled>Get Data</button>并且数据未自动加载,但由于单击了按钮。因此,您需要删除此行:setTimeout(function(){ getUnderlyingData(); }, 3000);

您可以在此处找到示例的工作演示:http://demo-tableau.bitballoon.com/