如何截断java String
,以便我知道一旦UTF-8编码它将适合给定数量的字节存储?
答案 0 :(得分:24)
这是一个简单的循环,它计算UTF-8表示的大小,并在超过它时截断:
public static String truncateWhenUTF8(String s, int maxBytes) {
int b = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
// ranges from http://en.wikipedia.org/wiki/UTF-8
int skip = 0;
int more;
if (c <= 0x007f) {
more = 1;
}
else if (c <= 0x07FF) {
more = 2;
} else if (c <= 0xd7ff) {
more = 3;
} else if (c <= 0xDFFF) {
// surrogate area, consume next char as well
more = 4;
skip = 1;
} else {
more = 3;
}
if (b + more > maxBytes) {
return s.substring(0, i);
}
b += more;
i += skip;
}
return s;
}
此 处理输出字符串中出现的surrogate pairs。 Java的UTF-8编码器(正确)将代理对输出为单个4字节序列而不是两个3字节序列,因此truncateWhenUTF8()
将返回最长的截断字符串。如果忽略实现中的代理对,则截断的字符串可能会短于它们所需的时间。
我没有对该代码进行过大量测试,但这里有一些初步测试:
private static void test(String s, int maxBytes, int expectedBytes) {
String result = truncateWhenUTF8(s, maxBytes);
byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
if (utf8.length > maxBytes) {
System.out.println("BAD: our truncation of " + s + " was too big");
}
if (utf8.length != expectedBytes) {
System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
}
System.out.println(s + " truncated to " + result);
}
public static void main(String[] args) {
test("abcd", 0, 0);
test("abcd", 1, 1);
test("abcd", 2, 2);
test("abcd", 3, 3);
test("abcd", 4, 4);
test("abcd", 5, 4);
test("a\u0080b", 0, 0);
test("a\u0080b", 1, 1);
test("a\u0080b", 2, 1);
test("a\u0080b", 3, 3);
test("a\u0080b", 4, 4);
test("a\u0080b", 5, 4);
test("a\u0800b", 0, 0);
test("a\u0800b", 1, 1);
test("a\u0800b", 2, 1);
test("a\u0800b", 3, 1);
test("a\u0800b", 4, 4);
test("a\u0800b", 5, 5);
test("a\u0800b", 6, 5);
// surrogate pairs
test("\uD834\uDD1E", 0, 0);
test("\uD834\uDD1E", 1, 0);
test("\uD834\uDD1E", 2, 0);
test("\uD834\uDD1E", 3, 0);
test("\uD834\uDD1E", 4, 4);
test("\uD834\uDD1E", 5, 4);
}
已更新修改后的代码示例,它现在处理代理项对。
答案 1 :(得分:22)
你应该使用CharsetEncoder,简单的getBytes()
+副本尽可能多地将UTF-8字符切成两半。
这样的事情:
public static int truncateUtf8(String input, byte[] output) {
ByteBuffer outBuf = ByteBuffer.wrap(output);
CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());
Charset utf8 = Charset.forName("UTF-8");
utf8.newEncoder().encode(inBuf, outBuf, true);
System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
return outBuf.position();
}
答案 2 :(得分:11)
这是我提出的,它使用标准的Java API,因此应该是安全的,并且与所有unicode怪异和代理对等兼容。解决方案取自http://www.jroller.com/holy/entry/truncating_utf_string_to_the,检查添加为null并避免解码当字符串的字节数少于 maxBytes 时。
/**
* Truncates a string to the number of characters that fit in X bytes avoiding multi byte characters being cut in
* half at the cut off point. Also handles surrogate pairs where 2 characters in the string is actually one literal
* character.
*
* Based on: http://www.jroller.com/holy/entry/truncating_utf_string_to_the
*/
public static String truncateToFitUtf8ByteLength(String s, int maxBytes) {
if (s == null) {
return null;
}
Charset charset = Charset.forName("UTF-8");
CharsetDecoder decoder = charset.newDecoder();
byte[] sba = s.getBytes(charset);
if (sba.length <= maxBytes) {
return s;
}
// Ensure truncation by having byte buffer = maxBytes
ByteBuffer bb = ByteBuffer.wrap(sba, 0, maxBytes);
CharBuffer cb = CharBuffer.allocate(maxBytes);
// Ignore an incomplete character
decoder.onMalformedInput(CodingErrorAction.IGNORE)
decoder.decode(bb, cb, true);
decoder.flush(cb);
return new String(cb.array(), 0, cb.position());
}
答案 3 :(得分:9)
UTF-8编码有一个简洁的特性,可以让你看到字节集中的位置。
检查您想要的字符数限制的流。
示例:如果您的信息流是:31 33 31 C1 A3 32 33 00,您可以将字符串设置为1,2,3,5,6或7个字节长,但不能设置为4字节,因为这样会将0放在后面C1,这是多字节字符的开头。
答案 4 :(得分:3)
您可以在不进行任何转换的情况下计算字节数。
foreach character in the Java string
if 0 <= character <= 0x7f
count += 1
else if 0x80 <= character <= 0x7ff
count += 2
else if 0x800 <= character <= 0xd7ff // excluding the surrogate area
count += 3
else if 0xdc00 <= character <= 0xffff
count += 3
else { // surrogate, a bit more complicated
count += 4
skip one extra character in the input stream
}
您必须检测代理对(D800-DBFF和U + DC00-U + DFFF)并为每个有效的代理对计数4个字节。如果您获得第一个范围中的第一个值,第二个范围中的第二个值,那么一切正常,跳过它们并添加4。 但如果没有,那么它就是一个无效的代理对。我不确定Java是如何处理的,但是你的算法必须在那个(不太可能的)情况下正确计算。
答案 5 :(得分:3)
您可以使用-new String(data.getBytes(“ UTF-8”),0,maxLen,“ UTF-8”);