在python中删除没有正则表达式和lxml的xml注释

时间:2015-07-24 01:28:01

标签: python

我需要帮助你们使用python删除xml注释... - >尝试了很多正则表达式,它删除了部分文本,这也是不期望的...... - >不想使用lxml - >请提供任何内置功能或方法的解决方案,如minidom ..

Xml数据:



<?xml version='1.0' encoding='utf-8'?>
<!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at

      http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<!-- The contents of this file will be loaded for each web application -->
<Context>

    <!-- Default set of monitored resources -->
    <WatchedResource>WEB-INF/web.xml</WatchedResource>

    <!-- Uncomment this to disable session persistence across Tomcat restarts -->
    <!--
    <Manager pathname="" />
    -->

    <!-- Uncomment this to enable Comet connection tacking (provides events
         on session expiration as well as webapp lifecycle) -->
    <!--
    <Valve className="org.apache.catalina.valves.CometConnectionManagerValve" />
    -->

</Context>
&#13;
&#13;
&#13;

期望的输出:

&#13;
&#13;
<?xml version='1.0' encoding='utf-8'?>
<Context>
    <WatchedResource>WEB-INF/web.xml</WatchedResource>
</Context>
&#13;
&#13;
&#13;

实际上我保持html插入数据,它是一个xml ......

2 个答案:

答案 0 :(得分:0)

我认为xml.dom中没有任何现有API会自动删除评论。您可以使用如下所示的简单递归函数来删除注释 -

import xml.dom.minidom as md
from xml.dom import Node
def removeComments(root):
    for c in root.childNodes[:]:
        if c.nodeType == Node.COMMENT_NODE:
            root.removeChild(c)
        elif c.nodeType in [Node.ELEMENT_NODE , Node.DOCUMENT_NODE]:
            removeComments(c)

使用此功能的示例 -

>>> s= """<?xml version='1.0' encoding='utf-8'?>
... <!--
...   Licensed to the Apache Software Foundation (ASF) under one or more
...   contributor license agreements.  See the NOTICE file distributed with
...   this work for additional information regarding copyright ownership.
...   The ASF licenses this file to You under the Apache License, Version 2.0
...   (the "License"); you may not use this file except in compliance with
...   the License.  You may obtain a copy of the License at
...
...       http://www.apache.org/licenses/LICENSE-2.0
...
...   Unless required by applicable law or agreed to in writing, software
...   distributed under the License is distributed on an "AS IS" BASIS,
...   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
...   See the License for the specific language governing permissions and
...   limitations under the License.
... -->
... <!-- The contents of this file will be loaded for each web application -->
... <Context>
...
...     <!-- Default set of monitored resources -->
...     <WatchedResource>WEB-INF/web.xml</WatchedResource>
...
...     <!-- Uncomment this to disable session persistence across Tomcat restarts -->
...     <!--
...     <Manager pathname="" />
...     -->
...
...     <!-- Uncomment this to enable Comet connection tacking (provides events
...          on session expiration as well as webapp lifecycle) -->
...     <!--
...     <Valve className="org.apache.catalina.valves.CometConnectionManagerValve" />
...     -->
...
... </Context>"""
>>>
>>> root = md.parseString(s)
>>> removeComments(root)
>>> print(root.toprettyxml())
<?xml version="1.0" ?>
<Context>





        <WatchedResource>WEB-INF/web.xml</WatchedResource>













</Context>

如果您想删除额外的换行符,也可以从TEXT_NODE中删除新行和标签。

答案 1 :(得分:0)

我没有得到所需的输出...开始未注释的标签丢失...

   

这个输出我之前使用正则表达式也得到了..但它不是理想的,因为没有注释的东西。

谢谢。