使用lxml和xpath解析Html

时间:2012-08-28 18:40:43

标签: python xpath html-parsing lxml

我正在尝试将lxml与python一起使用,因为在阅读并做谷歌推荐之后是使用lxml而不是其他解析包。我有以下dom结构,我管理写正确的xpath我仔细检查xpath检查我的xpath以确认它的有效性。 Xpath在Xpath Checker上运行正常但是当我在python中使用lxml时,我没有得到结果infract我得到的是对象而不是实际的文本。

这是我的dom结构:

<div class="pdsc-l">
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr>
<tr>
<tr>
<tr>
<tr>
<tr>
<td width="35%" valign="top">
<font size="2" face="Arial, Helvetica, sans-serif">Brand</font>
</td>
<td width="65%" valign="top">
<font size="2" face="Arial, Helvetica, sans-serif">HTC</font>
</td>
</tr>
<tr>
<td width="35%" valign="top">
<td width="65%" valign="top">

我写的xpath给了我想要的东西..

//td//font[text()='Brand']/following::td[1]

但是使用lxml我得到结果:

This is my code:
    rawPage = urllib2.urlopen(request)
    read = rawPage.read()
    #print read
    tree = etree.HTML(read)    
    for tr in tree.xpath("//tr"):
        print tr.xpath("//td//font[text()='Brand']/following::td[1]")

这是输出

[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]
[<Element td at 0x10ad80b90>]

我尝试了以下更改,但仍然没有得到结果,我写的代码有网址,希望这将有助于更好的答案:

from lxml import etree
from lxml.html import fromstring, tostring
    url = 'http://www.ebay.com/ctg/111176858'
    request = urllib2.Request(url)
    rawPage = urllib2.urlopen(request)
    read = rawPage.read()
    #print read
    tree = etree.HTML(read)    
    for tr in tree.xpath("//tr"):
        t = tr.xpath("//td//font[text()='Brand']/following::td[1]")[0]
        print tostring(t)

1 个答案:

答案 0 :(得分:8)

在答案中将[0].text附加到print语句的末尾应该可以为您提供所需内容。基本上,您的问题中打印的内容是lxml.etree._Element s的单元素列表,其中包含tagtext等属性,可用于获取不同的属性。所以,试试

tr.xpath("//td//font[text()='Brand']/following::td[1]")[0].text