来自xml的JSoup Strip html标记

时间:2014-02-17 16:01:47

标签: java html jsoup

我一直在寻找stackoverflow,但无法让任何人遇到这类问题。

我想做这样的事情:

输入字符串:

<?xml version="1.0" encoding="UTF-8" ?>
<List>
  <Object>
    <Section>Fruit</Section>
    <Category>Bananas</Category>
    <Brand>Chiquita</Brand>
    <Obs><p>
Vende-se a pe&ccedil;as ou o conjunto.</p><br>
    </Obs>
  </Object>
</List>

我想要的是删除html标签,例如<p>,<br>等。所以它的结尾如下:

<?xml version="1.0" encoding="UTF-8" ?>
<List>
  <Object>
    <Section>Fruit</Section>
    <Category>Bananas</Category>
    <Brand>Chiquita</Brand>
    <Obs>
Vende-se a pe&ccedil;as ou o conjunto.
    </Obs>
  </Object>
</List>

我一直在玩JSoup,但我似乎无法让它正常工作。

这是我的代码:

Whitelist whitelist = Whitelist.none();
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?><List><Object><Section>Fruit</Section><Category>Bananas</Category><Brand>Chiquita</Brand><Obs><p>Vende-se a pe&ccedil;as ou o conjunto.</p><br></Obs></Object></List>";

whitelist.addTags(new String[]{"?xml", "List", "Object", "Section", "Category", "Brand", "Obs"});
String safe = Jsoup.clean(xml, whitelist);

这是我获得的结果:

FruitBananasChiquitaVende-se a pe&ccedil;as ou o conjunto.

提前致谢

2 个答案:

答案 0 :(得分:3)

标签是小写的,使用:

whitelist.addTags(new String[] { "?xml", "list", "object", "section",
    "category", "brand", "obs" });

输出:

<list>
 <object>
  <section>
   Fruit
  </section>
  <category>
   Bananas
  </category>
  <brand>
   Chiquita
  </brand>
  <obs>
   Vende-se a pe&ccedil;as ou o conjunto.
  </obs></object>
</list>

答案 1 :(得分:2)

您可以使用unwrap()执行此操作:

示例:

    final String input = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n"
            + "<List>\n"
            + "  <Object>\n"
            + "    <Section>Fruit</Section>\n"
            + "    <Category>Bananas</Category>\n"
            + "    <Brand>Chiquita</Brand>\n"
            + "    <Obs><p>\n"
            + "Vende-se a pe&ccedil;as ou o conjunto.</p><br>\n"
            + "    </Obs>\n"
            + "  </Object>\n"
            + "</List>";

    Document doc = Jsoup.parse(input, "", Parser.xmlParser()); // XML-Parser!

    doc.select("p").unwrap(); // unwrapes all p-tags
    doc.select("br").unwrap(); // uńwraps all br-tags

此处最好使用 XML-Parser 而不是 HTML-Parser

<强>输出:

<?xml version="1.0" encoding="UTF-8" ?> 
<list> 
 <object> 
  <section>
   Fruit
  </section> 
  <category>
   Bananas
  </category> 
  <brand>
   Chiquita
  </brand> 
  <obs>
    Vende-se a pe&ccedil;as ou o conjunto. 
  </obs> </object> 
</list>