这可能是愚蠢的事。但我正在尝试编写一个简单的刮刀来从这个网站上获取列表:https://online.ncat.nsw.gov.au/Hearing/HearingList.aspx?LocationCode=2000
嗯,实际上它最终将针对每个LocationCode运行,但这是一个示例页面。
我想提取每个日期的<span>
标题和table
数据。
数据的一般形式是:
<span id="lblSubHeader1242017" class="clsGridItem">1:15 PM Wednesday, 12 Apr 2017 at Room 15.6 Level 15, 66 Goulburn st </span>
<hr />
<table id="dg1242017">
<tr class="clsGridItem">
<td width="15%">RT 17/11111</td>
<td width="30%">Name of party</td>
<td width="55%">Name of party</td>
</tr>
...
</table>
虽然很粗糙,但我可以使用以下形式的代码轻松获取表格数据:
page = requests.get('https://online.ncat.nsw.gov.au/Hearing/HearingList.aspx?LocationCode=2000')
tree = html.fromstring(page.content)
events = tree.xpath('//table//td/text()')
但是当我尝试抓住桌子外面的跨度时,我可以获得位置和日期信息,例如:
days = tree.xpath('//span[starts-with(@id,"lbl")]/text()')
或
days = tree.xpath('//span[@class,"clsGridItem"]/text()')
我只得到以下两个结果:
days: ['There are no matters listed in SYDNEY today', 'There are no matters listed in SYDNEY today']
这些涉及两个跨度,大约是页面下方的2/3:
<span id="lbl1442017" style="font-weight:bold;">SYDNEY: Friday, 14 Apr 2017</span><br /><br /><span id="lblError1442017" class="clsGridItem">There are no matters listed in SYDNEY today</span><br /><br /><br /><span id="lbl1742017" style="font-weight:bold;">SYDNEY: Monday, 17 Apr 2017</span><br /><br /><span id="lblError1742017" class="clsGridItem">There are no matters listed in SYDNEY today</span>
有人可以向我解释我做错了吗?
为什么跳过其他跨度?
答案 0 :(得分:1)
您可以使用以下代码获取<span class="clsGridItem">
的每个文字内容:
days = tree.xpath('//span[@class="clsGridItem"]//text()')
但我不知道为什么//span[@class="clsGridItem"]/text()
不能正常工作should be applicable as well...