用python解析xpath

时间:2015-02-21 02:20:28

标签: python xpath lxml lxml.html

我正在尝试解析包含此内容的网页:

<table style="width: 100%; border-top: 1px solid black; border-bottom: 1px solid black;">
<tr>
 <td colspan="2"
     style="border-top: 1px solid black; border-bottom: 1px solid black; background-color: #f0ffd3;">February 20, 2015</td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">9:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">14°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>
<tr>
 <td style="border-top: 1px solid gray; font-weight: bold;">10:00 PM</td>
 <td style="border-top: 1px solid gray; font-weight: bold">13°F</td>
</tr>
<tr>
 <td style="border-bottom: 1px solid gray;">Clear<br />
  Precip:
  0 %<br />
                                Wind:
                    from the WSW at 6 mph
 </td>
 <td style="border-bottom: 1px solid gray;"><img class="wxicon" src="http://i.imwx.com/web/common/wxicons/31/31.gif"
       style="border: 0px; padding: 0px 3px" /></td>
</tr>

(它继续更多行,以[/ table]

结束
tree = html.fromstring(page)
table = tree.xpath('//table/tr')
for item in table:
    for elem in item.xpath('*'):
        if 'colspan' in html.tostring(elem):
                print '*', elem.text
        elif elem.text is not None:
            print elem.text,
        else:
            print 
有点工作。它没有得到[br /]之后的文本,而且它远非优雅。如何获取遗失的文字?此外,任何改进代码的建议都将受到赞赏。

1 个答案:

答案 0 :(得分:2)

如何使用.text_content()

  

.text_content():返回元素的文本内容,包括文本内容   它的孩子,没有标记。

table = tree.xpath('//table/tr')
for item in table:
    print ' '.join(item.text_content().split())

join() + split()此处有助于用一个空格替换多个空格。

打印:

February 20, 2015
9:00 PM 14°F
Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F
Clear Precip: 0 % Wind: from the WSW at 6 mph

由于您希望将时间线与沉淀线合并,您可以迭代tr个标签,但跳过文本中包含Precip的标签。对于每个时间线,获得以下tr兄弟来获得沉淀线:

table = tree.xpath('//table/tr[not(contains(., "Precip"))]')
for item in table:
    text = ' '.join(item.text_content().split())
    if 'AM' in text or 'PM' in text:
        text += ' ' + ' '.join(item.xpath('following-sibling::tr')[0].text_content().split())

    print text

打印:

February 20, 2015
9:00 PM 14°F Clear Precip: 0 % Wind: from the WSW at 6 mph
10:00 PM 13°F Clear Precip: 0 % Wind: from the WSW at 6 mph