从网页中提取路径

时间:2018-01-10 10:49:49

标签: python xpath

我想从网页中提取一个动态值(有3个可能的值,即(PLAYING,BUFFERING,IDLE)),下面是我的代码,目前不会打印任何内容。

我要提取的xpath是

  

/ HTML /体/表/ TR [3] /第[2] /字体

代码:

import urllib2
from lxml import etree
from lxml.cssselect import CSSSelector


SERVER = 'http://203.xx.xx.xxx:8080/uistatus.html'
authinfo = urllib2.HTTPPasswordMgrWithDefaultRealm()
authinfo.add_password(None, SERVER, 'admin', 'xxxxxxx')
page = 'http://203.82.99.234:8080/uistatus.html'
handler = urllib2.HTTPBasicAuthHandler(authinfo)
myopener = urllib2.build_opener(handler)
opened = urllib2.install_opener(myopener)
output = urllib2.urlopen(page)
htmlparser = etree.HTMLParser()
tree = etree.parse(output, htmlparser)

print output.read()


#/html/body/table/tr[3]/th[2]/font
td_empformbody = CSSSelector('html.body.table.tr[3].th[2].font')
for elem in td_empformbody(tree):
    # Do something with these table cells.
    print elem

注意要提取的元素在3个值之间发生变化,如下所示

<font color=#68e52b>PLAYING</font>
<font color=#efb506>BUFFERING</font>
<font color=#666666>IDLE</font>

页面来源:

<html><!---->
<head>
<meta http-equiv=refresh content="2; url=uistatus.html">
<style type="text/css">
<!--
body,td,th {
    font-family: Arial, Helvetica, sans-serif;
    font-size: 13px;
    font-weight: bold;
    color: #333333;
}

th,td {
    background-color: #e6e8eb;
}

th.wh,td.wh {
    background-color: #ffffff;
}
-->
</style>
</head>
<body>
<table cellspacing="0" cellpadding="3" width="510px" align="center">

    <tr>
        <td><form action=setup.cgi method=POST style="float: right">
        <input type=hidden name=L value=uilogout.html><input type=hidden name=D><input type=submit value=" Logout ">
        </form>

    <tr>
        <td class="wh">Player
        <th colspan="2">Status
        <th colspan="2">&nbsp;&nbsp;Source&nbsp;&nbsp;
        <th colspan="2">&nbsp;&nbsp;Channel&nbsp;&nbsp;
        <th>Shuffle
        <th>Repeat
    <tr>
        <th class="wh">
        <th colspan="2"><font color=#68e52b>PLAYING</font>

更新

我根据下面的评论修改了上面的代码,并收到以下错误

File "scripts/xpath.py", line 21, in <module>
    td_empformbody = CSSSelector('html.body.table.tr[3].th[2].font')
  File "/usr/local/lib/python2.7/dist-packages/lxml/cssselect.py", line 94, in __init__
    path = translator.css_to_xpath(css)
  File "/usr/local/lib/python2.7/dist-packages/cssselect/xpath.py", line 192, in css_to_xpath
    for selector in parse(css))
  File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 355, in parse
    return list(parse_selector_group(stream))
  File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 368, in parse_selector_group
    yield Selector(*parse_selector(stream))
  File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 376, in parse_selector
    result, pseudo_element = parse_simple_selector(stream)
  File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 435, in parse_simple_selector
    result = parse_attrib(result, stream)
  File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 499, in parse_attrib
    attrib = stream.next_ident_or_star()
  File "/usr/local/lib/python2.7/dist-packages/cssselect/parser.py", line 754, in next_ident_or_star
    "Expected ident or '*', got %s" % (next,))
cssselect.parser.SelectorSyntaxError: Expected ident or '*', got <NUMBER '3' at 19>

1 个答案:

答案 0 :(得分:0)

更改将使其正常工作的xpath:tree.xpath("/html/body/table/tr[3]/th[2]/font/text()")

但是,我推荐的xpath是tree.xpath("//th/font[@color]/text()"),因为即使tr [3]或th [2]由于某种原因(例如,在表格中输入了另一个元素)移位,推荐的xpath仍然会工作

注意:我没有使用CSSSelector,我对此一无所知,但由于问题的标签为xpath,我认为xpath可以解决您的问题。