使用以下代码我如何解析html表结果?可以更好地找到html的一个示例。
import requests
from lxml import etree
import StringIO
def http_request():
try:
url = "http://somehost/somehtml.html"
r = requests.get(url, auth=("theUser", "thepass"))
r.encoding ='ISO-8859-1'
html = r.content
parse_result(html)
except requests.HTTPError, e:
return False
sys.exit(1)
def parse_result(result):
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(result), parser)
# Here should be the logic to parse the html result :)
if __name__ == '__main__':
http_request()
这是html:
<!DOCTYPE html PUBLIC "-//W3C//Dtd XHTML 1.0 Strict//EN"
"http://www.w3.org/tr/xhtml1/Dtd/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 25 March 2009), see www.w3.org" />
<title></title>
</head>
<body>
<table border="1">
<tr>
<td valign="top"><B>name</B></td>
<td>result name a</td>
</tr>
<tr>
<td valign="top"><B>inUse</B></td>
<td>false</td>
</tr>
</table>
<table border="1">
<tr>
<td valign="top"><B>name</B></td>
<td>result name b</td>
</tr>
<tr>
<td valign="top"><B>inUse</B></td>
<td>false</td>
</tr>
</table>
<table border="1">
<tr>
<td valign="top"><B>name</B></td>
<td>result name c</td>
</tr>
<tr>
<td valign="top"><B>inUse</B></td>
<td>true</td>
</tr>
</table>
</body>
</html>
预期的重新设置将检索名称和 inUse 字段结果,即“结果名称”和“错误”。
提前致谢。
答案 0 :(得分:0)
假设你输入的html具有这种格式:
nodes = etree.XPath("/html/body/table")
for node in nodes(tree):
print '%s %s' % (node[0][1].text, node[1][1].text)
从您的示例html中输出:
result name a false
result name b false
result name c true
如果格式会在示例html之外发生变化,那么您可能需要对XPath更具创意,并添加更多输入检查。