我使用lxml获取html表头的值,但是当我尝试使用xpath解析td内部的tr内容时,它给我空值,因为数据是动态生成的。 下面是我的python代码及其输出值。 我怎样才能得到这些值?
<table id="datatabl" class="display compact cell-border dataTable no-footer" role="grid" aria-describedby="datatabl_info">
<thead>
<tr role="row">
<th class="dweek sorting_desc" tabindex="0" aria-controls="datatabl" rowspan="1" colspan="1" style="width: 106px;" aria-label="Week: activate to sort column ascending" aria-sort="descending">Week</th>
<th class="dnone sorting" tabindex="0" aria-controls="datatabl" rowspan="1" colspan="1" style="width: 100px;" aria-label="None: activate to sort column ascending">None</th>
</tr>
</thead>
<tbody>
<tr class="odd" role="row">
<td class="sorting_1">2016-05-03</td>
<td>4.27</td>
<td>21.04</td>
</tr>
<tr class="even" role="row">
<td class="sorting_1">2016-04-26</td>
<td>4.24</td>
<td>95.76</td>
<td>21.04</td>
</tr>
</tbody>
我的Python代码
from lxml import etree
import urllib
web = urllib.urlopen("http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[@id="datatabl"]/thead')
print tr_nodes
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("tr")]
print header
## tbody
tr_nodes_content = html.xpath('//table[@id="datatabl"]/tbody')
print tr_nodes_content
td_content = [[td[0].text for td in tr.xpath('td')] for tr in tr_nodes_content[0]]
print td_content
终端输出:
[<Element thead at 0xb6b250ac>]
['Week']
[<Element tbody at 0xb6ad20cc>]
[]
答案 0 :(得分:2)
数据是从http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM
端点动态加载的。一种选择是尝试模仿该请求并从JSON响应中获取数据。
或者,您可以保持高级别并通过selenium
解决问题:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.maximize_window()
wait = WebDriverWait(driver, 10)
url = 'http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx'
driver.get(url)
# wait for the table to load
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#datatabl tr[role=row]")))
rows = driver.find_elements_by_css_selector("table#datatabl tr[role=row]")[1:]
for row in rows:
cells = row.find_elements_by_tag_name("td")
print(cells[2].text)
driver.close()
打印D0-D4列的内容:
33.89
39.64
39.28
39.20
...
36.74
38.45
43.61
答案 1 :(得分:2)
这将以json格式从ajax请求获取数据:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36',
'Content-Type': 'application/json',
'Referer': 'http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx',
'X-Requested-With': 'XMLHttpRequest',
}
import json
data = json.dumps({'area':'conus', 'type':'conus', 'statstype':'1'})
ajax = requests.post("http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM",
data=data,
headers=headers)
from pprint import pprint as pp
pp(ajax.json())
输出片段:
{u'd': [{u'D0': 33.89,
u'D1': 14.56,
u'D2': 5.46,
u'D3': 3.44,
u'D4': 1.11,
u'Date': u'2016-05-03',
u'FileDate': u'20160503',
u'None': 66.11,
u'ReleaseID': 890,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 39.64,
u'D1': 15.38,
u'D2': 5.89,
u'D3': 3.44,
u'D4': 1.11,
u'Date': u'2016-04-26',
u'FileDate': u'20160426',
u'None': 60.36,
u'ReleaseID': 889,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 39.28,
u'D1': 15.44,
u'D2': 5.94,
u'D3': 3.44,
u'D4': 1.11,
u'Date': u'2016-04-19',
u'FileDate': u'20160419',
u'None': 60.72,
u'ReleaseID': 888,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 39.2,
u'D1': 17.75,
u'D2': 6.1,
u'D3': 3.76,
u'D4': 1.71,
u'Date': u'2016-04-12',
u'FileDate': u'20160412',
u'None': 60.8,
u'ReleaseID': 887,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 37.86,
u'D1': 16.71,
u'D2': 5.95,
u'D3': 3.76,
u'D4': 1.71,
u'Date': u'2016-04-05',
u'FileDate': u'20160405',
u'None': 62.14,
u'ReleaseID': 886,
u'__type': u'DroughtMonitorData.DmData'},
您可以从返回的json中获取所需的所有数据,如果print(len(cont.json()["d"]))
您将看到返回853行,那么您实际上似乎可以一次性获取35页中的所有数据。即使您确实解析了页面,您仍然需要再执行34次,从ajax请求获取json可以轻松解析所有内容。
要按州过滤,我们需要将type
设置为state
,将area
设置为CA
:
data = json.dumps({'type':'state', 'statstype':'1','area':'CA'})
ajax = requests.post("http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM",
data=data,
headers=headers)
from pprint import pprint as pp
pp(ajax.json())
再一个简短的片段:
{u'd': [{u'D0': 95.73,
u'D1': 89.68,
u'D2': 74.37,
u'D3': 49.15,
u'D4': 21.04,
u'Date': u'2016-05-03',
u'FileDate': u'20160503',
u'None': 4.27,
u'ReleaseID': 890,
u'__type': u'DroughtMonitorData.DmData'},
{u'D0': 95.76,
u'D1': 90.09,
u'D2': 74.37,
u'D3': 49.15,
u'D4': 21.04,
u'Date': u'2016-04-26',
u'FileDate': u'20160426',
u'None': 4.24,
u'ReleaseID': 889,
u'__type': u'DroughtMonitorData.DmData'},
您将看到的内容与页面上显示的内容相符。