Question

我使用Jsoup.parse()从字符串中删除html标记。但我的字符串也像<name>这样的单词。

问题是Jsoup.parse（）也删除了。因为该文本有＆lt;和＆gt;。我不能删除＆lt;和＆gt;从文本也。我怎么能这样做。

String s1 = Jsoup.parse("<p>Hello World</p>").text();
//s1 is "Hello World". Correct

String s2 = Jsoup.parse("<name>").text();
//s2 is "". But it should be <name> because <name> is not a html tag

Answer 1

我正在使用Jsoup.parse（）从String中删除html标记。

您想使用Jsoup#clean方法。您之后还需要一些手动工作，因为Jsoup仍会将<name>视为HTML标记。

// Define the list of words to preserve...
String[] myExceptions = new String[] { "name" }; 
int nbExceptions = myExceptions.length;

// Build a whitelist for Jsoup...
Whitelist myWhiteList = Whitelist.simpleText().addTags(myExceptions);

// Let Jsoup remove any html tags...
String s2 = Jsoup.clean("<name>", myWhiteList);

// Complete the initial html tags removal...
for (int i = 0; i < nbExceptions; i++) {
    s2 = s2.replaceAll("<" + myExceptions[i] + ">.+?</" + myExceptions[i] + ">", "<" + myExceptions[i] + ">");
}

System.out.println(">>" + s2);

输出

>><name>

Jsoup解析器删除单词＆＃39;＆lt;＆＃39;和＆＃39;＆gt;＆＃39;

1 个答案:

输出

参考