我正试图从这个页面http://stats.rleague.com/rl/seas/2014.html获取一个团队和分数列表,这只是一个学习练习。
我的导入和页面没有得到预期的结果。
In [1]: from lxml import html
In [2]: import requests
In [3]: page = requests.get('http://stats.rleague.com/rl/seas/2014.html')
In [4]: tree = html.fromstring(page.text)
这是标题的html。
<html><title>Rugby League Tables / Season 2014</title>
和团队
<tr><td width=20%><a href="../teams/souths/souths_idx.html">Souths</a></td><td width=12%>4t 6g </td><td width=5%> 28</td><td><b>Date:</b>Thu 06-Mar-2014 <b>Venue:</b><a href="../venues/stadium_australia.html">Stadium Australia</a> <b>Crowd:</b>27,282</td></tr>
<tr><td width=20%><a href="../teams/easts/easts_idx.html">Sydney Roosters</a></td><td width=12%>1t 2g </td><td width=5%> 8</td><td><b>Souths</b> won by <b> 20 pts</b>
但是我得到空白名单,我做错了什么?
In [6]: print(tree)
<Element html at 0x7f518067fc78>
In [7]: titles = tree.xpath('//html[@title]/text()')
In [8]: print(titles)
[]
In [11]: teams = tree.xpath('//tr/td[@href]/text()')
In [12]: print(teams)
[]
答案 0 :(得分:1)
更改XPath表达式将为您提供所需的结果:
# `title` is not an attribute, but a tag.
titles = tree.xpath('.//title/text()')
print(titles)
# `td` does not have `href` attribute, but `a` tag.
teams = tree.xpath('//tr/td/a[@href]/text()')
print(teams)