使用Python,如何基于id标签从HTML文件中提取信息?

时间:2019-06-27 15:45:46

标签: python html-parsing lxml

我正在尝试创建一个python脚本,该脚本将从一些HTML文件中提取信息。我使用osglob没问题,可以获取所有必需的文件。但是最困难的部分是解析那些文件。到目前为止,这是我的代码:

from lxml import etree
...
parser = etree.HTMLParser(remove_comments=True, recover=True)
tree = etree.parse(os.path.join(path, filename), parser=parser)
...
for item in tree.getiterator():
    id = item.attrib.get('id', None)

    if item.tag == 'title':
        device.name = item.text
    elif id:
        setattr(device, id, item.text)

此代码似乎可以处理文件中的某些信息,例如:

<td id="type">Network Camera</td>

但是HTML文件中有几行像这样:

<td colspan="2"><span id="name"></span>:&nbsp;XYZ</td>

我没有得到任何有用的信息。我插入了打印语句,可以看到元素td(没有idtext)和span(有了id,但也没有{{ 1}})。

然后有一个:

text

...在我的人眼中似乎很明显我应该得到<td><table><tr> <td><a href="..." id="ipLink"> <span id="ipTxt"></span></a>:&nbsp; </td><td> 1.2.4.3&nbsp;(<span id="staTxt"></span>) </td> </tr></table></td> ,但是我不知道如何说服python来提取它。


更新:

完成示例输入文件:

ip=1.2.4.3

所需的提取信息:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
    <meta http-equiv="Pragma" content="no-cache">
<title>AXIS M3037</title>
</head>
<body>


<table>
  <tr>
    <td id="type">Network Camera</td>
    <td>|</td>
    <td valign="middle" align="left" width=169 class="menuActive" id="mainMenu" nowrap>

    </td>
    <td><a href="/" id="tLViewTxt"><span id="ti2LViewTxt"></span></a></td>
    <td><a href="/?id=171" id="tSetTxt"><span id="ti2SetTxt"></span></a></td>
    <td colspan="2"><span id="version"></span>:&nbsp;1.23</td>

    <td>
        1.2.1.1&nbsp;(<span id="xyz"></span>)
    </td>
    <td colspan="2">
        <a href="/?id=171" id="dateTimeLink">
            <span id="datTimTxt"></span>
        </a>&nbsp;
        <input type="text" name="CurrentServerDate" value="2018-08-14" disabled>
        &nbsp;&nbsp;&nbsp;
        <input type="text" name="CurrentServerTime" value="11:03:49" disabled>
    </td>

    <td><table><tr>
        <td><a href="..." id="ipLink">
                <span id="ipTxt"></span>
            </a>:&nbsp;
        </td><td>
            1.2.4.3&nbsp;(<span id="staTxt"></span>)
        </td>
    </tr></table></td>
  </tr>
  <tr>
    <td nowrap colspan="2">:&nbsp;
        1
        &nbsp;<span id="videoTxt"></span>&nbsp;&nbsp;
        0
        &nbsp;<span id="audTxt"></span>
        &nbsp;&nbsp;</td>
    <td colspan="2" nowrap>
        <span id="upTimTxt"></span>&nbsp;
        <span id="theuptimevalue">130 days, 3:40</span></td>
  </tr>
</table>
</body>
</html>

1 个答案:

答案 0 :(得分:1)

好吧,以下内容非常令人费解,而且可能很脆弱,但这确实在提供的html上起到了作用:

from lxml.html import fromstring
data = [your html above]

tree = fromstring(data)

for typ in tree.xpath("*//td[@id='type']"):
    print('type',typ.text)
for spa in tree.xpath("*//span[@id='version']/../text()"):
    print('version',spa)
for spa in tree.xpath("*//span[@id='name']/../text()"):
    print(spa.replace(':','').strip(),tree.xpath("*//span[@id='name']/../following-sibling::td/text()")[0].strip())
for spa in tree.xpath("(*//span[@id='staTxt']/..)[2]"):
    print('ipTxt',spa.text.strip())
for spa in tree.xpath("*//span[@id='videoTxt']/.."):
    print('videoTxt',spa.text.replace(':','').strip())  
for spa in tree.xpath("*//span[@id='audTxt']/.."):
    num = "".join(spa.text_content().split())
    print('audTxt2',num[2])
for spa in tree.xpath("*//span[@id='theuptimevalue']"):
    print('theuptimevalue',spa.text.replace(':','').strip())  

输出:

type Network Camera
version : 1.23
XYZ 1.2.1.1
ipTxt 1.2.4.3
videoTxt 1
audTxt2 0
theuptimevalue 130 days, 340

如果使用它,您可能可以对其进行改进,但这应该是一个开始...