基本上,这篇文章是一个挑战。我一直在努力优化HTML转义功能,取得了一定的成功。但是我知道那里有一些严肃的Java黑客,他们可能比我做得更好,我很乐意学习。
我一直在分析我的Java Web应用程序,并发现一个主要的热点是我们的String转义函数。我们目前使用Apache Commons Lang来完成此任务,调用StringEscapeUtils.escapeHtml()。我认为它被广泛使用它会相当快,但即便是我最天真的实现也要快得多。
这是我和Naive实现一起使用的基准代码。它测试不同长度的字符串,有些只包含纯文本,有些包含需要转义的HTML。
public class HTMLEscapeBenchmark {
public static String escapeHtml(String text) {
if (text == null) return null;
StringBuilder sb = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
char c = text.charAt(i);
if (c == '&') {
sb.append("&");
} else if (c == '\'') {
sb.append("'");
} else if (c == '"') {
sb.append(""");
} else if (c == '<') {
sb.append("<");
} else if (c == '>') {
sb.append(">");
} else {
sb.append(c);
}
}
return sb.toString();
}
/*
public static String escapeHtml(String text) {
if (text == null) return null;
return StringEscapeUtils.escapeHtml(text);
}
*/
public static void main(String[] args) {
final int RUNS = 5;
final int ITERATIONS = 1000000;
// Standard lorem ipsum text.
String loremIpsum = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut " +
"labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut " +
"aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum " +
"dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia " +
"deserunt mollit anim id est laborum. ";
while (loremIpsum.length() < 1000) loremIpsum += loremIpsum;
// Add some characters that need HTML escaping. Bold every 2 and 3 letter word, quote every 5 letter word.
String loremIpsumHtml = loremIpsum.replaceAll("[A-Za-z]{2}]", "<b>$0</b>").replaceAll("[A-Za-z]{5}", "\"$0\"");
System.out.print("\nNormal-10");
String text = loremIpsum.substring(0, 10);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nNormal-100");
text = loremIpsum.substring(0, 100);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nNormal-1000");
text = loremIpsum.substring(0, 1000);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nHtml-10");
text = loremIpsumHtml.substring(0, 10);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nHtml-100");
text = loremIpsumHtml.substring(0, 100);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
System.out.print("\nHtml-1000");
text = loremIpsumHtml.substring(0, 1000);
for (int run = 1; run <= RUNS; run++) {
long start = System.nanoTime();
for (int i = 0; i < ITERATIONS; i++) {
escapeHtml(text);
}
System.out.printf("\t%.3f", (System.nanoTime() - start) / 1e9);
}
}
}
在我两岁的MacBook pro上,我得到以下结果。
Commons Lang StringEscapeUtils.escapeHtml
Normal-10 0.439 0.357 0.351 0.343 0.342
Normal-100 2.244 0.934 0.930 0.932 0.931
Normal-1000 8.993 9.020 9.007 9.043 9.052
Html-10 0.270 0.259 0.258 0.258 0.257
Html-100 1.769 1.753 1.765 1.754 1.759
Html-1000 17.313 17.479 17.347 17.266 17.246
天真实施
Normal-10 0.111 0.091 0.086 0.084 0.088
Normal-100 0.636 0.627 0.626 0.626 0.627
Normal-1000 5.740 5.755 5.721 5.728 5.720
Html-10 0.145 0.138 0.138 0.138 0.138
Html-100 0.899 0.901 0.896 0.901 0.900
Html-1000 8.249 8.288 8.272 8.262 8.284
我会发布自己最佳的优化尝试作为答案。所以,我的问题是,你能做得更好吗?什么是逃避HTML的最快方法?
答案 0 :(得分:3)
这是我优化它的最佳尝试。我针对纯文本字符串的常见情况进行了优化,但对于使用HTML实体的字符串,我无法做得更好。
public static String escapeHtml(String value) {
if (value == null) return null;
int length = value.length();
String encoded;
for (int i = 0; i < length; i++) {
char c = value.charAt(i);
if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
// We found a character to encode, so we need to start from here and buffer the encoded string.
StringBuilder sb = new StringBuilder((int) (length * 1.25));
sb.append(value.substring(0, i));
sb.append(encoded);
i++;
for (; i < length; i++) {
c = value.charAt(i);
if (c <= 62 && (encoded = getHtmlEntity(c)) != null) {
sb.append(encoded);
} else {
sb.append(c);
}
}
value = sb.toString();
break;
}
}
return value;
}
private static String getHtmlEntity(char c) {
switch (c) {
case '&': return "&";
case '\'': return "'";
case '"': return """;
case '<': return "<";
case '>': return ">";
default: return null;
}
}
Normal-10 0.021 0.023 0.011 0.012 0.011
Normal-100 0.074 0.074 0.073 0.074 0.074
Normal-1000 0.671 0.678 0.675 0.675 0.680
Html-10 0.222 0.152 0.153 0.153 0.154
Html-100 0.739 0.715 0.718 0.724 0.706
Html-1000 6.812 6.828 6.802 6.802 6.806
答案 1 :(得分:2)
我认为因为它被如此广泛地使用它会相当快,但即使是我最天真的实现也要快得多。
如果你查看Apache版本的源代码(for example),你会发现它正在处理你的版本忽略的一些案例:
简而言之,它更慢,因为它更通用。