我想从网页中提取一个动态值(有3个可能的值,即(PLAYING,BUFFERING,IDLE)),下面是我的代码,目前不会打印任何内容。
我要提取的xpath是
/ HTML /体/表/ TR [3] /第[2] /字体
代码:
import urllib2
from lxml import etree
from lxml.cssselect import CSSSelector
SERVER = 'http://203.xx.xx.xxx:8080/uistatus.html'
authinfo = urllib2.HTTPPasswordMgrWithDefaultRealm()
authinfo.add_password(None, SERVER, 'admin', 'xxxxxxx')
page = 'http://203.82.99.234:8080/uistatus.html'
handler = urllib2.HTTPBasicAuthHandler(authinfo)
myopener = urllib2.build_opener(handler)
opened = urllib2.install_opener(myopener)
output = urllib2.urlopen(page)
htmlparser = etree.HTMLParser()
tree = etree.parse(output, htmlparser)
print output.read()
#/html/body/table/tr[3]/th[2]/font
td_empformbody = CSSSelector('html.body.table.tr[3].th[2].font')
for elem in td_empformbody(tree):
# Do something with these table cells.
print elem
注意要提取的元素在3个值之间发生变化,如下所示
<font color=#68e52b>PLAYING</font>
<font color=#efb506>BUFFERING</font>
<font color=#666666>IDLE</font>
页面来源:
<html><!---->
<head>
<meta http-equiv=refresh content="2; url=uistatus.html">
<style type="text/css">
<!--
body,td,th {
font-family: Arial, Helvetica, sans-serif;
font-size: 13px;
font-weight: bold;
color: #333333;
}
th,td {
background-color: #e6e8eb;
}
th.wh,td.wh {
background-color: #ffffff;
}
-->
</style>
</head>
<body>
<table cellspacing="0" cellpadding="3" width="510px" align="center">
<tr>
<td><form action=setup.cgi method=POST style="float: right">
<input type=hidden name=L value=uilogout.html><input type=hidden name=D><input type=submit value=" Logout ">
</form>
<tr>
<td class="wh">Player
<th colspan="2">Status
<th colspan="2"> Source
<th colspan="2"> Channel
<th>Shuffle
<th>Repeat
<tr>
<th class="wh">
<th colspan="2"><font color=#68e52b>PLAYING</font>
更新
我根据下面的评论修改了上面的代码,并收到以下错误
File "scripts/xpath.py", line 21, in <module>
td_empformbody = CSSSelector('html.body.table.tr[3].th[2].font')
File "/usr/local/lib/python2.7/dist-packages/lxml/cssselect.py", line 94, in __init__
path = translator.css_to_xpath(css)
File "/usr/local/lib/python2.7/dist-packages/cssselect/xpath.py", line 192, in css_to_xpath
for selector in parse(css))
File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 355, in parse
return list(parse_selector_group(stream))
File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 368, in parse_selector_group
yield Selector(*parse_selector(stream))
File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 376, in parse_selector
result, pseudo_element = parse_simple_selector(stream)
File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 435, in parse_simple_selector
result = parse_attrib(result, stream)
File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 499, in parse_attrib
attrib = stream.next_ident_or_star()
File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 754, in next_ident_or_star
"Expected ident or '*', got %s" % (next,))
cssselect.parser.SelectorSyntaxError: Expected ident or '*', got <NUMBER '3' at 19>
答案 0 :(得分:0)
更改将使其正常工作的xpath:tree.xpath("/html/body/table/tr[3]/th[2]/font/text()")
但是,我推荐的xpath是tree.xpath("//th/font[@color]/text()")
,因为即使tr [3]或th [2]由于某种原因(例如,在表格中输入了另一个元素)移位,推荐的xpath仍然会工作
注意:我没有使用CSSSelector
,我对此一无所知,但由于问题的标签为xpath,我认为xpath可以解决您的问题。