Java 8+ 解决方案：

Question

在普通Java代码中输出HTML时，是否有推荐的方法来转义<，>，"和&个字符？（除了手动执行以下操作外，即）。

String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = source.replace("<", "&lt;").replace("&", "&amp;"); // ...

Answer 1

来自StringEscapeUtils的

Apache Commons Lang：

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);

version 3：

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;
// ...
String escaped = escapeHtml4(source);

Answer 2

Apache Commons的替代方案：使用Spring的HtmlUtils.htmlEscape(String input)方法。

Answer 3

简短的方法：

public static String escapeHTML(String s) {
    StringBuilder out = new StringBuilder(Math.max(16, s.length()));
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c > 127 || c == '"' || c == '<' || c == '>' || c == '&') {
            out.append("&#");
            out.append((int) c);
            out.append(';');
        } else {
            out.append(c);
        }
    }
    return out.toString();
}

基于https://stackoverflow.com/a/8838023/1199155（放大器在那里失踪）。根据{{3}}

，if子句中检查的四个字符是128以下的唯一字符

Answer 4

有一个较新版本的Apache Commons Lang library，它使用不同的包名称（org.apache.commons.lang3）。 StringEscapeUtils现在有不同的静态方法来转义不同类型的文档（http://commons.apache.org/proper/commons-lang/javadocs/api-3.0/index.html）。所以要转义HTML 4.0版字符串：

import static org.apache.commons.lang3.StringEscapeUtils.escapeHtml4;

String output = escapeHtml4("The less than sign (<) and ampersand (&) must be escaped before using them in HTML");

Answer 5

在Android（API 16或更高版本）上，您可以：

Html.escapeHtml(textToScape);

或更低的API：

TextUtils.htmlEncode(textToScape);

Answer 6

对于那些使用Google Guava的人：

import com.google.common.html.HtmlEscapers;
[...]
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = HtmlEscapers.htmlEscaper().escape(source);

Answer 7

小心这一点。 HTML文档中有许多不同的“上下文”：在元素内部，引用的属性值，不带引号的属性值，URL属性，javascript，CSS等...您需要为每个使用不同的编码方法这些是为了防止跨站点脚本（XSS）。检查OWASP XSS Prevention备忘单，了解每个上下文的详细信息 - https://www.owasp.org/index.php/XSS_(Cross_Site_Scripting)_Prevention_Cheat_Sheet。您可以在OWASP ESAPI库中找到每个上下文的转义方法 - https://github.com/ESAPI/esapi-java-legacy。

Answer 8

出于某些目的，HtmlUtils：

import org.springframework.web.util.HtmlUtils;
[...]
HtmlUtils.htmlEscapeDecimal("&")` //gives &#38;
HtmlUtils.htmlEscape("&")` //gives &amp;

Answer 9

虽然@ org.apache.commons.lang.StringEscapeUtils.escapeHtml的@dfa答案很好，我过去使用过它，但它不应该用于转义HTML（或XML）属性，否则空格将被标准化（意味着所有相邻的空格字符都成为单个空格。）

我知道这是因为我对我的库（JATL）提出了针对未保留空白的属性的错误。因此，我有一个下降（复制n'粘贴）class (of which I stole some from JDOM) that differentiates the escaping of attributes and element content。

虽然这可能在过去没有那么重要（适当的属性转义），但是由于使用了HTML5的data-属性用法，它越来越受到关注。

Answer 10

org.apache.commons.lang3.StringEscapeUtils现已弃用。您现在必须使用org.apache.commons.text.StringEscapeUtils

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>${commons.text.version}</version>
    </dependency>

Answer 11

大多数库都提供了尽可能的转义，包括成百上千个符号和数千个非ASCII字符，这在UTF-8世界中不是您想要的。

而且，正如Jeff Williams指出的那样，没有一个单独的“转义HTML”选项，有多个上下文。

假设您从不使用未加引号的属性，并记住存在不同的上下文，那么它已经编写了我自己的版本：

private static final long BODY_ESCAPE =
        1L << '&' | 1L << '<' | 1L << '>';
private static final long DOUBLE_QUOTED_ATTR_ESCAPE =
        1L << '"' | 1L << '&' | 1L << '<' | 1L << '>';
private static final long SINGLE_QUOTED_ATTR_ESCAPE =
        1L << '"' | 1L << '&' | 1L << '\'' | 1L << '<' | 1L << '>';

// 'quot' and 'apos' are 1 char longer than '#34' and '#39' which I've decided to use
private static final String REPLACEMENTS = "&#34;&amp;&#39;&lt;&gt;";
private static final int REPL_SLICES = /*  |0,   5,   10,  15, 19, 23*/
        5<<5 | 10<<10 | 15<<15 | 19<<20 | 23<<25;
// These 5-bit numbers packed into a single int
// are indices within REPLACEMENTS which is a 'flat' String[]

private static void appendEscaped(
        StringBuilder builder,
        CharSequence content,
        long escapes // pass BODY_ESCAPE or *_QUOTED_ATTR_ESCAPE here
) {
    int startIdx = 0, len = content.length();
    for (int i = 0; i < len; i++) {
        char c = content.charAt(i);
        long one;
        if (((c & 63) == c) && ((one = 1L << c) & escapes) != 0) {
        // -^^^^^^^^^^^^^^^   -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        // |                  | take only dangerous characters
        // | java shifts longs by 6 least significant bits,
        // | e. g. << 0b110111111 is same as >> 0b111111.
        // | Filter out bigger characters

            int index = Long.bitCount(SINGLE_QUOTED_ATTR_ESCAPE & (one - 1));
            builder.append(content, startIdx, i /* exclusive */)
                    .append(REPLACEMENTS,
                            REPL_SLICES >>> 5*index & 31,
                            REPL_SLICES >>> 5*(index+1) & 31);
            startIdx = i + 1;
        }
    }
    builder.append(content, startIdx, len);
}

考虑从Gist without line length limit复制粘贴。

Answer 12

Java 8+ 解决方案：

public static String escapeHTML(String str) {
    return str.chars().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
       "&#" + c + ";" : String.valueOf((char) c)).collect(Collectors.joining());
}

String#chars 从字符串返回一个 IntStream 字符值。然后我们可以使用 mapToObj 对字符代码大于 127 的字符（非 ASCII 字符）以及双引号（"）、单引号（'）、左尖括号 (<)、右尖括号 (>) 和与号 (&)。 Collectors.joining 将 String 重新连接在一起。

为了更好地处理 Unicode 字符，可以改用 String#codePoints。

public static String escapeHTML(String str) {
    return str.codePoints().mapToObj(c -> c > 127 || "\"'<>&".indexOf(c) != -1 ?
            "&#" + c + ";" : new String(Character.toChars(c)))
       .collect(Collectors.joining());
}

在Java中转义HTML的推荐方法

12 个答案:

Java 8+ 解决方案：