Question

我有一个包含一些HTML标记的XML文件。我想保留XML标记但删除HTML标记。例如，在以下结构中

<xml_tag_parent>
     <xml_tag_child>
       Some text here <p> some parag here </p>
     </xml_tag_child>
</xml_tag_parent>

我想得到：

<xml_tag_parent>
     <xml_tag_child>
       Some text here some parag here 
     </xml_tag_child>
</xml_tag_parent>

我不知道提前有什么xml标签。另请注意，HTML标记可能是嵌套的，因此我不能只获取Node的值。例如，在以下xml文档中：

<description id="description">
  <heading id="h-0001" level="1">CROSS REFERENCE</heading>
  <p id="p-0002" num="0001">The Paragraph </p>
  <claim attr="someAttr"> abcs </claim>
  <claim attr="2">
    <p> this is another paragraph <b>with some bold things</b> </p>
  </claim>
</description id="description">

我想得到：

<description id="description">
  CROSS REFERENCE The Paragraph
  <claim attr="someAttr"> abcs </claim>
  <claim attr="2">
    this is another paragraph with some bold things
  </claim>
</description id="description">

我可以尝试对所有HTML标记进行硬编码，找到它们并删除它们。例如，我可以寻找＆lt; p>标记并用空字符串替换它，但这听起来不对，除了我需要覆盖很多标签。 Java中是否有库或更好的方法？

Answer 1

您可以使用 Jericho jar来实现您的目标。

它能够提取html标签并忽略所有其他标签，符合您的要求。

http://jericho.htmlparser.net/docs/index.html

Answer 2

您可以使用Jsoup Library，它可以帮助您删除html标记。可以从jsoup获得完整的教程。并且用于废弃html标记的代码是：

public static String htmltagremove(String html) {
     return Jsoup.parse(html).text();
}

如何保留XML标记但删除HTML标记

2 个答案: