Question

我给出了以下值（使用Windows-1252转义）

ABC＆amp;＃145; ＆amp;＃146; ＆amp;＃147; ＆amp;＃148; ＆amp;＃226;，＆amp;＃234;，＆amp;＃238;，＆amp;＃244;，＆amp;＃251; （我需要添加空格来显示确切的值，实际上数字之间没有空格;）

但实际值是，我想要与下面相同的值

ABC''“â，ê，î，ô，û

我试过HtmlUtils.htmlUnescape（decodingString）;但没有奏效我得到的输出就像 ABCâ，ê，î，ô，û

''“”已删除。

您能否在java中提供如何执行此操作？

Answer 1

您可以使用正则表达式。

    Pattern p = Pattern.compile("&#(\\d+);");
    StringBuffer out = new StringBuffer();


    String s = "ABC&#145;&#146;&#226;D";
    Matcher m = p.matcher(s);
    int startIdx = 0; 
    byte[] bytes = new byte[]{0};
    while(startIdx < s.length() && m.find(startIdx)) {
        if (m.start() > startIdx) {
            out.append(s.substring(startIdx, m.start()));
        }
        // fetch the numeric value from the encoding and put it into a byte array 
        bytes[0] = (byte)Short.parseShort(m.group(1));
        // convert the windows 1252 encoded byte array into a java string 
        out.append(new String(bytes,"Windows-1252"));   
        startIdx = m.end();
    }

    if (startIdx < s.length()) {
        out.append(s.substring(startIdx));
    }

输出/结果将类似于

ABC''âD

Answer 2

引号字符可能仍在字符串中，它们在显示时只是不可见。这是因为在Unicode或ISO 8859-1中，代码点145未分配给可见字符。

最佳解决方案（如果可能）是将编码传递给unescapeHtml方法。

另一种方法是首先调用htmlUnescape，然后使用以下代码将cp1252代码点映射到相应的Unicode代码点：

String unescapeHtmlCp1252(String input) {
    String nohtml = HtmlUtils.htmlUnescape(input);
    byte[] bytes = nohtml.getBytes(StandardCharsets.ISO_8859_1);
    String result = new String(bytes, Charset.forName("cp1252"));
    return result;
}

当您使用调试器单步执行此代码并检查nohtml字符串时，您可能会看到值为145,146的字符，依此类推。这意味着此时角色仍然存在。

稍后，当使用字体将字符转换为像素时，这些字符没有定义，因此只是被忽略。但是直到这一步，他们仍然在那里。

使用java

2 个答案: