xpath:需要使用python lxml标记的强大代码

时间:2017-02-02 08:27:10

标签: python xpath lxml

我想仅为下面的html

抓取表格的代码和名称
it isn't used now.

欲望输出是

<div id="ctl00_cph1_divSymbols" class="cb"><table class="quotes">
<TR><TH>Code</TH><TH>Name</TH><TH style="text-align:right;">High</TH><TH style="text-align:right;">Low</TH><TH style="text-align:right;">Close</TH><TH style="text-align:right;">Volume</TH><TH style="text-align:center;" colspan=3>Change</TH><th width=40>&nbsp;</th></tr>
<tr class="ro" onclick="location.href='/stockquote/SGX/Z25.htm';" style="color:green;"><td><A href="/stockquote/SGX/Z25.htm" title="Display Quote &amp; Chart for SGX,Z25">Z25</A></td><td>Yanlord Land Group Limited</td><td align=right>1.400</td><td align=right>1.380</td><td align=right>1.385</td><td align=right>1,244,200</td><td align="right">0.005</td><td align="center"><IMG src="/images/up.gif"></td><td align="left">0.36</td><td align="right"><a href="/stockquote/SGX/Z25.htm" title="Download Data for SGX,Z25"><img src="/images/dl.gif" width=14 height=14></a>&nbsp;<a href="/stockquote/SGX/Z25.htm" title="View Quote and Chart for SGX,Z25"><img src="/images/chart.gif" width=14 height=14></a></td></tr>
<tr class="re" onclick="location.href='/stockquote/SGX/Z59.htm';" style="color:green;"><td><A href="/stockquote/SGX/Z59.htm" title="Display Quote &amp; Chart for SGX,Z59">Z59</A></td><td>Yoma Strategic Holdings Ltd</td><td align=right>0.5850</td><td align=right>0.5750</td><td align=right>0.5850</td><td align=right>2,312,600</td><td align="right">0.0100</td><td align="center"><IMG src="/images/up.gif"></td><td align="left">1.74</td><td align="right"><a href="/stockquote/SGX/Z59.htm" title="Download Data for SGX,Z59"><img src="/images/dl.gif" width=14 height=14></a>&nbsp;<a href="/stockquote/SGX/Z59.htm" title="View Quote and Chart for SGX,Z59"><img src="/images/chart.gif" width=14 height=14></a></td></tr>
<tr class="ro" onclick="location.href='/stockquote/SGX/Z74.htm';" style="color:green;"><td><A href="/stockquote/SGX/Z74.htm" title="Display Quote &amp; Chart for SGX,Z74">Z74</A></td><td>Singtel</td><td align=right>3.930</td><td align=right>3.860</td><td align=right>3.910</td><td align=right>21,674,300</td><td align="right">0.040</td><td align="center"><IMG src="/images/up.gif"></td><td align="left">1.03</td><td align="right"><a href="/stockquote/SGX/Z74.htm" title="Download Data for SGX,Z74"><img src="/images/dl.gif" width=14 height=14></a>&nbsp;<a href="/stockquote/SGX/Z74.htm" title="View Quote and Chart for SGX,Z74"><img src="/images/chart.gif" width=14 height=14></a></td></tr>
<tr class="re" onclick="location.href='/stockquote/SGX/Z77.htm';" style="color:green;"><td><A href="/stockquote/SGX/Z77.htm" title="Display Quote &amp; Chart for SGX,Z77">Z77</A></td><td>Singtel 10</td><td align=right>3.920</td><td align=right>3.860</td><td align=right>3.900</td><td align=right>69,460</td><td align="right">0.050</td><td align="center"><IMG src="/images/up.gif"></td><td align="left">1.30</td><td align="right"><a href="/stockquote/SGX/Z77.htm" title="Download Data for SGX,Z77"><img src="/images/dl.gif" width=14 height=14></a>&nbsp;<a href="/stockquote/SGX/Z77.htm" title="View Quote and Chart for SGX,Z77"><img src="/images/chart.gif" width=14 height=14></a></td></tr>
</table>
</div>

我的python代码如下:

Z25,Yanlord Land Group Limited
Z59,Yoma Strategic Holdings Ltd
Z74,Singtel
Z77,Singtel 10

tree1正确地给我代码但是tree2名称与许多不需要的数据混合。如何为欲望输出提供强大的代码?

1 个答案:

答案 0 :(得分:0)

您可以使用td[2]获取第二个td标记:

from lxml import html
import requests
page = requests.get('http://eoddata.com/stocklist/SGX/Z.htm')    
tree = html.fromstring(page.content)
tree1 = tree.xpath('//td/a[contains(@href,"/stockquote/SGX")]/text()')
# tree2 = tree.xpath('//tr[@class]/td/following-sibling::td/text()')
tree2 = tree.xpath('//tr[@class and @onclick]/td[2]/text()')

print tree1, tree2

请注意,为了避开右下方的表,[@class and @onclik]用于定位我们需要的表。

结果:

['Z25', 'Z59', 'Z74', 'Z77'] ['Yanlord Land Group Limited', 'Yoma Strategic Holdings Ltd', 'Singtel', 'Singtel 10']