Question

我一直在寻找stackoverflow，但无法让任何人遇到这类问题。

我想做这样的事情：

输入字符串：

<?xml version="1.0" encoding="UTF-8" ?>
<List>
  <Object>
    <Section>Fruit</Section>
    <Category>Bananas</Category>
    <Brand>Chiquita</Brand>
    <Obs><p>
Vende-se a pe&ccedil;as ou o conjunto.</p><br>
    </Obs>
  </Object>
</List>

我想要的是删除html标签，例如<p>,<br>等。所以它的结尾如下：

<?xml version="1.0" encoding="UTF-8" ?>
<List>
  <Object>
    <Section>Fruit</Section>
    <Category>Bananas</Category>
    <Brand>Chiquita</Brand>
    <Obs>
Vende-se a pe&ccedil;as ou o conjunto.
    </Obs>
  </Object>
</List>

我一直在玩JSoup，但我似乎无法让它正常工作。

这是我的代码：

Whitelist whitelist = Whitelist.none();
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?><List><Object><Section>Fruit</Section><Category>Bananas</Category><Brand>Chiquita</Brand><Obs><p>Vende-se a pe&ccedil;as ou o conjunto.</p><br></Obs></Object></List>";

whitelist.addTags(new String[]{"?xml", "List", "Object", "Section", "Category", "Brand", "Obs"});
String safe = Jsoup.clean(xml, whitelist);

这是我获得的结果：

FruitBananasChiquitaVende-se a pe&ccedil;as ou o conjunto.

提前致谢

Answer 1

标签是小写的，使用：

whitelist.addTags(new String[] { "?xml", "list", "object", "section",
    "category", "brand", "obs" });

输出：

<list>
 <object>
  <section>
   Fruit
  </section>
  <category>
   Bananas
  </category>
  <brand>
   Chiquita
  </brand>
  <obs>
   Vende-se a pe&ccedil;as ou o conjunto.
  </obs></object>
</list>

Answer 2

您可以使用unwrap()执行此操作：

示例：

final String input = "<?xml version=\"1.0\" encoding=\"UTF-8\" ?>\n" + "<List>\n" + " <Object>\n" + " <Section>Fruit</Section>\n" + " <Category>Bananas</Category>\n" + " <Brand>Chiquita</Brand>\n" + " <Obs><p>\n" + "Vende-se a peças ou o conjunto.</p><br>\n" + " </Obs>\n" + " </Object>\n" + "</List>"; Document doc = Jsoup.parse(input, "", Parser.xmlParser()); // XML-Parser! doc.select("p").unwrap(); // unwrapes all p-tags doc.select("br").unwrap(); // uńwraps all br-tags

此处最好使用 XML-Parser 而不是 HTML-Parser 。

<强>输出：

<?xml version="1.0" encoding="UTF-8" ?> <list> <object> <section> Fruit </section> <category> Bananas </category> <brand> Chiquita </brand> <obs> Vende-se a peças ou o conjunto. </obs> </object> </list>

来自xml的JSoup Strip html标记

2 个答案: