我创建了以下内容,用于将java中的字符串截断为具有给定字节数的新字符串。
String truncatedValue = "";
String currentValue = string;
int pivotIndex = (int) Math.round(((double) string.length())/2);
while(!truncatedValue.equals(currentValue)){
currentValue = string.substring(0,pivotIndex);
byte[] bytes = null;
bytes = currentValue.getBytes(encoding);
if(bytes==null){
return string;
}
int byteLength = bytes.length;
int newIndex = (int) Math.round(((double) pivotIndex)/2);
if(byteLength > maxBytesLength){
pivotIndex = newIndex;
} else if(byteLength < maxBytesLength){
pivotIndex = pivotIndex + 1;
} else {
truncatedValue = currentValue;
}
}
return truncatedValue;
这是我想到的第一件事,我知道我可以改进它。我看到另一篇帖子在那里问了一个类似的问题,但他们使用字节而不是String.substring截断字符串。我想我宁愿在我的情况下使用String.substring。
编辑:我刚刚删除了UTF8参考,因为我宁愿能够为不同的存储类型执行此操作。
答案 0 :(得分:13)
为什么不转换为字节并前进 - 遵循UTF8字符边界 - 直到获得最大数字,然后将这些字节转换回字符串?
或者如果你跟踪切割的位置,你可以剪切原始字符串:
// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
public static String cut(String s, int n) {
byte[] utf8 = s.getBytes();
if (utf8.length < n) n = utf8.length;
int n16 = 0;
int advance = 1;
int i = 0;
while (i < n) {
advance = 1;
if ((utf8[i] & 0x80) == 0) i += 1;
else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
else { i += 4; advance = 2; }
if (i <= n) n16 += advance;
}
return s.substring(0,n16);
}
}
注意:编辑以修复2014-08-25
中的错误答案 1 :(得分:5)
我认为Rex Kerr的解决方案有2个错误。
请在下面找到我的更正版本:
public String cut(String s, int charLimit) throws UnsupportedEncodingException {
byte[] utf8 = s.getBytes("UTF-8");
if (utf8.length <= charLimit) {
return s;
}
int n16 = 0;
boolean extraLong = false;
int i = 0;
while (i < charLimit) {
// Unicode characters above U+FFFF need 2 words in utf16
extraLong = ((utf8[i] & 0xF0) == 0xF0);
if ((utf8[i] & 0x80) == 0) {
i += 1;
} else {
int b = utf8[i];
while ((b & 0x80) > 0) {
++i;
b = b << 1;
}
}
if (i <= charLimit) {
n16 += (extraLong) ? 2 : 1;
}
}
return s.substring(0, n16);
}
我仍然认为这远非有效。因此,如果您不需要结果的String表示形式,而字节数组就可以,您可以使用:
private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
byte[] utf8 = s.getBytes("UTF-8");
if (utf8.length <= charLimit) {
return utf8;
}
if ((utf8[charLimit] & 0x80) == 0) {
// the limit doesn't cut an UTF-8 sequence
return Arrays.copyOf(utf8, charLimit);
}
int i = 0;
while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
++i;
}
if ((utf8[charLimit-i-1] & 0x80) > 0) {
// we have to skip the starter UTF-8 byte
return Arrays.copyOf(utf8, charLimit-i-1);
} else {
// we passed all UTF-8 bytes
return Arrays.copyOf(utf8, charLimit-i);
}
}
有趣的是,在实际的20-500字节限制下,它们执行几乎相同的 IF ,您再次从字节数组创建一个字符串。
请注意,这两种方法都假设有效的utf-8输入,这是使用Java的getBytes()函数后的有效假设。
答案 2 :(得分:5)
更理智的解决方案是使用解码器:
final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
final byte[] bytes = inputString.getBytes(CHARSET);
final CharsetDecoder decoder = CHARSET.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.reset();
final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
final String outputString = decoded.toString();
答案 3 :(得分:3)
使用UTF-8 CharsetEncoder,并通过查找CoderResult.OVERFLOW进行编码,直到输出ByteBuffer包含您愿意接受的字节数。
答案 4 :(得分:3)
答案 5 :(得分:2)
如上所述,Peter Lawrey解决方案具有主要的性能劣势(10,000次约3,500msc),Rex Kerr更好(10,000次约500msc)但结果不准确 - 它的切割量比需要的多得多(相反)剩余的4000字节,它为某些例子重新发布了3500)。假设UTF-8最大长度char(以字节为单位)为4(感谢WikiPedia),附上我的解决方案(约250msc,10,000次):
public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{
double MAX_UTF8_CHAR_LENGTH = 4.0;
if(word.length()>dbLimit){
word = word.substring(0, dbLimit);
}
if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){
int residual=word.getBytes("UTF-8").length-dbLimit;
if(residual>0){
int tempResidual = residual,start, end = word.length();
while(tempResidual > 0){
start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH));
tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length;
end=start;
}
word = word.substring(0, end);
}
}
return word;
}
答案 6 :(得分:2)
s = new String(s.getBytes("UTF-8"), 0, MAX_LENGTH - 2, "UTF-8");
答案 7 :(得分:1)
您可以将字符串转换为字节,并将这些字节转换回字符串。
public static String substring(String text, int maxBytes) {
StringBuilder ret = new StringBuilder();
for(int i = 0;i < text.length(); i++) {
// works out how many bytes a character takes,
// and removes these from the total allowed.
if((maxBytes -= text.substring(i, i+1).getBytes().length) < 0) break;
ret.append(text.charAt(i));
}
return ret.toString();
}
答案 8 :(得分:0)
这是我的:
private static final int FIELD_MAX = 2000;
private static final Charset CHARSET = Charset.forName("UTF-8");
public String trancStatus(String status) {
if (status != null && (status.getBytes(CHARSET).length > FIELD_MAX)) {
int maxLength = FIELD_MAX;
int left = 0, right = status.length();
int index = 0, bytes = 0, sizeNextChar = 0;
while (bytes != maxLength && (bytes > maxLength || (bytes + sizeNextChar < maxLength))) {
index = left + (right - left) / 2;
bytes = status.substring(0, index).getBytes(CHARSET).length;
sizeNextChar = String.valueOf(status.charAt(index + 1)).getBytes(CHARSET).length;
if (bytes < maxLength) {
left = index - 1;
} else {
right = index + 1;
}
}
return status.substring(0, index);
} else {
return status;
}
}
答案 9 :(得分:0)
通过使用下面的正则表达式,您还可以删除双字节字符的前导和尾随空格。
stringtoConvert = stringtoConvert.replaceAll("^[\\s ]*", "").replaceAll("[\\s ]*$", "");
答案 10 :(得分:0)
这个可能不是更有效的解决方案,但有效
public static String substring(String s, int byteLimit) {
if (s.getBytes().length <= byteLimit) {
return s;
}
int n = Math.min(byteLimit-1, s.length()-1);
do {
s = s.substring(0, n--);
} while (s.getBytes().length > byteLimit);
return s;
}
答案 11 :(得分:0)
我已经改进了Peter Lawrey的准确处理代理对的解决方案。另外,我根据UTF-8编码中每char
个字节的最大字节数为3进行了优化。
public static String substring(String text, int maxBytes) {
for (int i = 0, len = text.length(); (len - i) * 3 > maxBytes;) {
int j = text.offsetByCodePoints(i, 1);
if ((maxBytes -= text.substring(i, j).getBytes(StandardCharsets.UTF_8).length) < 0)
return text.substring(0, i);
i = j;
}
return text;
}
答案 12 :(得分:0)
scala中的二进制搜索方法:
private def bytes(s: String) = s.getBytes("UTF-8")
def truncateToByteLength(string: String, length: Int): String =
if (length <= 0 || string.isEmpty) ""
else {
@tailrec
def loop(badLen: Int, goodLen: Int, good: String): String = {
assert(badLen > goodLen, s"""badLen is $badLen but goodLen is $goodLen ("$good")""")
if (badLen == goodLen + 1) good
else {
val mid = goodLen + (badLen - goodLen) / 2
val midStr = string.take(mid)
if (bytes(midStr).length > length)
loop(mid, goodLen, good)
else
loop(badLen, mid, midStr)
}
}
loop(string.length * 2, 0, "")
}