根据标签修改html文件?

时间:2015-05-05 19:03:20

标签: python html

我有几个html文件,其内容如下:

<html>
    <header>
        <title>A test</title>
    </header>
    <body>
        <table>
            <tr>
                <td id="MenuTD" style="vertical-align: top;"> 
                    Stuff here <a>with a link</a>
                    <p>Or paragraph tags</p>
                    <div>Or a DIV</div>
                </td>
                <td>Another TD element, without the MenuTD id</td>
            </tr>
        </table>
        <div>
             <link rel="stylesheet" href="\d\d\d\d_files/zannotationtargettoggle.css" type="text/css">
        </div>
    </body>
</html>

其中\d是数字的占位符,确切数字因文件而异。

我想编写一个Python程序,将每个html文件转换为以下格式:

<html>
    <header>
        <title>A test</title>
    </header>
    <body>
        <link rel="stylesheet" href="\d\d\d\d_files/zannotationtargettoggle.css" type="text/css">
        <td id="MenuTD" style="vertical-align: top;"> 
            Stuff here <a>with a link</a>
            <p>Or paragraph tags</p>
            <div>Or a DIV</div>
        </td>
    </body>
</html>

具体地,

  1. 我们如何提取标题标记<header>...</header><link rel="stylesheet" href="\d\d\d\d_files/zannotationtargettoggle.css" type="text/css">,因为他们没有ID?

  2. 如果正文标记具有属性,例如<body style="margin-left: 6px; cursor: default;" onload="InitBody();">...</body>,我们应该如何在其开头和结尾标记中清空内容...,然后在其中添加<link rel="stylesheet" href="\d\d\d\d_files/zannotationtargettoggle.css" type="text/css">menu_td的内容?

  3. 谢谢!

1 个答案:

答案 0 :(得分:2)

您可以使用BeautifulSoup修改输入文档:

import bs4

doc = bs4.BeautifulSoup(s) # s your input html
td = doc.find('td')
doc.find('table').replace_with(doc.find('link'))
doc.find('div').replace_with(td)

测试结果文档:

>>> print str(doc)
<html>
<body><header>
<title>A test</title>
</header>
<link href="\d\d\d\d_files/zannotationtargettoggle.css" rel="stylesheet" type="text/css"/>
<td id="MenuTD" style="vertical-align: top;"> 
                    Stuff here <a>with a link</a>
<p>Or paragraph tags</p>
<div>Or a DIV</div>
</td>
</body></html>

或者您可以构建一个新文档:

doc = bs4.BeautifulSoup(s)
doc2 = bs4.BeautifulSoup('<html />')
doc2.html.append(doc.header)
doc2.html.append(doc2.new_tag('body'))
doc2.body.append(doc.link)
doc2.body.append(doc.find('td'))