使用lxml从HTML表中获取结果

时间:2013-05-06 16:20:01

标签: python lxml

使用以下代码我如何解析html表结果?可以更好地找到html的一个示例。

import requests
from lxml import etree
import StringIO

def http_request():

    try:
        url = "http://somehost/somehtml.html"
        r = requests.get(url, auth=("theUser", "thepass"))
        r.encoding ='ISO-8859-1'
        html = r.content
        parse_result(html)
    except requests.HTTPError, e:
        return False
        sys.exit(1)

def parse_result(result):
    parser = etree.HTMLParser()
    tree = etree.parse(StringIO.StringIO(result), parser)

    # Here should be the logic to parse the html result :)


if __name__ == '__main__':
    http_request()

这是html:

<!DOCTYPE html PUBLIC "-//W3C//Dtd XHTML 1.0 Strict//EN"
    "http://www.w3.org/tr/xhtml1/Dtd/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />

  <title></title>
</head>

<body>
  <table border="1">
    <tr>
    <td valign="top"><B>name</B></td>
    <td>result name a</td>
    </tr>
    <tr>
    <td valign="top"><B>inUse</B></td>
    <td>false</td>
    </tr>
  </table>
  <table border="1">
    <tr>
    <td valign="top"><B>name</B></td>
    <td>result name b</td>
    </tr>
    <tr>
    <td valign="top"><B>inUse</B></td>
    <td>false</td>
    </tr>
  </table>
  <table border="1">
    <tr>
    <td valign="top"><B>name</B></td>
    <td>result name c</td>
    </tr>
    <tr>
    <td valign="top"><B>inUse</B></td>
    <td>true</td>
    </tr>
  </table>
</body>
</html>

预期的重新设置将检索名称 inUse 字段结果,即“结果名称”和“错误”。

提前致谢。

1 个答案:

答案 0 :(得分:0)

假设你输入的html具有这种格式:

nodes = etree.XPath("/html/body/table")
for node in nodes(tree):
    print '%s %s' % (node[0][1].text, node[1][1].text)

从您的示例html中输出:

result name a false
result name b false
result name c true

如果格式会在示例html之外发生变化,那么您可能需要对XPath更具创意,并添加更多输入检查。