在Java中剪切包括2字节字符的字符串的更好方法是什么

时间:2017-08-08 14:13:01

标签: java split substring messages cjk

我正在编写一种方法,为其他两个系统之间的接口创建固定长度的消息。

必须按照约定的长度(字节)为每个项目传输消息,但如果消息长度超过约定的长度,则消息应按项目的长度截断。

该消息包含2个字节的字符,因此如果它在字符中间被截断,则会被截断为损坏的状态。

为了计算正确的字节,它将从头开始搜索要剪切的长度。如果消息很长,性能应该很差。

我找不到更好的方法,所以我在这里寻求帮助。对不起,代码很复杂冗余。整个项目可用here

package thecodinglog.string;

public class StringHelper {

public static String substrb2(String str, Number beginByte) {
    return substrb2(str, beginByte, null, null, null);
}

public static String substrb2(String str, Number beginByte, Number byteLength) {
    return substrb2(str, beginByte, byteLength, null, null);
}

/**
 * Returns the substring of the String.
 * It returns a string as specified length and byte position.
 * You can pad characters left or right when there is a specified length.
 * It distinguishes between 1 byte character and 2 byte character and returns it exactly as specified byte length.
 * If the start position or the specified length causes a 2-byte character to be truncated in the middle,
 * it will be converted to Space.
 * You can specify either left or right padding.
 *
 * If beginByte is 0, it is changed to 1 and processed.
 * If beginByte is less than 0, the string is searched for from right to left.
 * If beginByte or byteLength is a real number, the decimal point is discarded.
 * If you do not specify a length, returns everything from the starting position to the right-end string.
 *
 * Examples:
 * <blockquote><pre>
 *     StringHelper.substrb2("a好호b", 1, 10, null, "|") returns "a好호b||||"
 *     StringHelper.substrb2("ab한글", 4, 2) returns "  "
 *     StringHelper.substrb2("한a글", -3, 2) returns "a "
 *     StringHelper.substrb2("abcde한글이han gul다ykd", 7) returns " 글이han gul다ykd"
 * </pre></blockquote>
 *
 * @param str a string to substring
 * @param beginByte the beginning byte
 * @param byteLength length of bytes
 * @param leftPadding a character for padding. It must be 1 byte character.
 * @param rightPadding a character for padding. It must be 1 byte character.
 * @return a substring
 */
public static String substrb2(String str, Number beginByte, Number byteLength, String leftPadding, String rightPadding) {
    if (str == null || str.equals("")) {
        throw new IllegalArgumentException("The source string can not be an empty string or null.");
    }

    if (leftPadding != null && rightPadding != null) {
        throw new IllegalArgumentException("Left padding, right padding Either of two must be null.");
    }

    if (leftPadding != null) {
        if (leftPadding.length() != 1) {
            throw new IllegalArgumentException("The length of the padding string must be one.");
        }
        if (getByteLengthOfChar(leftPadding.charAt(0)) != 1) {
            throw new IllegalArgumentException("The padding string must be 1 Byte character.");
        }
    }

    if (rightPadding != null) {
        if (rightPadding.length() != 1) {
            throw new IllegalArgumentException("The length of the padding string must be one.");
        }
        if (getByteLengthOfChar(rightPadding.charAt(0)) != 1) {
            throw new IllegalArgumentException("The padding string must be 1 Byte character.");
        }
    }

    int beginPosition = beginByte.intValue();
    if (beginPosition == 0) beginPosition = 1;

    int length;
    if (byteLength != null) {
        length = byteLength.intValue();
        if (length < 0) {
            return null;
        }
    } else {
        length = -1;
    }

    if (length == 0)
        return null;

    boolean beginHalf = false;
    int accByte = 0;
    int startIndex = -1;

    if (beginPosition >= 0) {
        for (int i = 0; i < str.length(); i++) {
            if (beginPosition - 1 == accByte) {
                startIndex = i;
                accByte = accByte + getByteLengthOfChar(str.charAt(i));
                break;
            } else if (beginPosition == accByte) {
                beginHalf = true;
                startIndex = i;
                accByte = accByte + getByteLengthOfChar(str.charAt(i));
                break;
            } else if (accByte + 2 == beginPosition && i == str.length() - 1) {
                beginHalf = true;
                accByte = accByte + getByteLengthOfChar(str.charAt(i));
                break;
            }
            accByte = accByte + getByteLengthOfChar(str.charAt(i));
        }
    } else {
        beginPosition = beginPosition * -1;
        if(length > beginPosition){
            length = beginPosition;
        }

        for (int i = str.length() - 1; i >= 0; i--) {

            accByte = accByte + getByteLengthOfChar(str.charAt(i));

            if (i == str.length() - 1) {
                if (getByteLengthOfChar(str.charAt(i)) == 1) {
                    if (beginPosition == accByte) {
                        startIndex = i;
                        break;
                    }
                } else {
                    if (beginPosition == accByte) {
                        if (length > 1) {
                            startIndex = i;
                            break;
                        } else {
                            beginHalf = true;
                            break;
                        }
                    }else if(beginPosition == accByte - 1){
                        if(length == 1){
                            beginHalf = true;
                            break;
                        }
                    }
                }
            } else {
                if (getByteLengthOfChar(str.charAt(i)) == 1) {
                    if (beginPosition == accByte) {
                        startIndex = i;
                        break;
                    }
                } else {
                    if (beginPosition == accByte) {
                        if (length > 1) {
                            startIndex = i;
                            break;
                        } else {
                            beginHalf = true;
                            break;
                        }

                    } else if(beginPosition == accByte - 1) {
                        if(length > 1){
                            startIndex = i + 1;
                        }
                        beginHalf = true;
                        break;

                    }
                }

            }
        }
    }


    if (accByte < beginPosition) {
        throw new IndexOutOfBoundsException("The start position is larger than the length of the original string.");
    }


    StringBuilder stringBuilder = new StringBuilder();
    int accSubstrLength = 0;

    if (beginHalf) {
        stringBuilder.append(" ");
        accSubstrLength++;
    }


    if (byteLength == null) {
        stringBuilder.append(str.substring(startIndex));
        return new String(stringBuilder);
    }


    for (int i = startIndex; i < str.length() && startIndex >= 0; i++) {
        accSubstrLength = accSubstrLength + getByteLengthOfChar(str.charAt(i));
        if (accSubstrLength == length) {
            stringBuilder.append(str.charAt(i));
            break;
        } else if (accSubstrLength - 1 == length) {
                stringBuilder.append(" ");
            break;
        } else if (accSubstrLength - 1 > length) {

            break;
        }
        stringBuilder.append(str.charAt(i));
    }

    if (leftPadding != null) {
        int diffLength = byteLength.intValue() - accSubstrLength;
        StringBuilder padding = new StringBuilder();
        for (int i = 0; i < diffLength; i++) {
            padding.append(leftPadding);
        }
        stringBuilder.insert(0, padding);
    }

    if (rightPadding != null) {
        int diffLength = byteLength.intValue() - accSubstrLength;
        StringBuilder padding = new StringBuilder();
        for (int i = 0; i < diffLength; i++) {
            padding.append(rightPadding);
        }
        stringBuilder.append(padding);
    }


    return new String(stringBuilder);
}

private static int getByteLengthOfChar(char c) {
    if ((int) c < 128) {
        return 1;
    } else {
        return 2;
    }
}
}

新尝试的代码是

String testData = "한글이가득";

Charset charset = Charset.forName("EUC-KR");
ByteBuffer byteBuffer = charset.encode(testData);

byte[] newone = Arrays.copyOfRange(byteBuffer.array(), 1, 5);

CharsetDecoder charsetDecoder = charset.newDecoder()
        .replaceWith(" ")
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE);

CharBuffer charBuffer = charsetDecoder.decode(ByteBuffer.wrap(newone));

System.out.println(charBuffer.toString());

我期待“글”而不是“畸邦”。 我认为起始索引必须是正确的解码位置,但我认为不可能让方法知道我想要的东西。

添加示例失败

index| 0 1    2 3    4 5    6 7    8 9 
Char |   한  |   글  |   이 |   가  |  득   
---- | ---- | ---- | ---- | ---- | ----  
hex  | c7d1 | b1db | c0cc | b0a1 | b5e6   
---- | ---- | ---- | ---- | ---- | ----

假设起始索引为1且长度为4个字节,则子十六进制代码将为此

index| 0 1    2 3    4 5    6 7    8 9 
Char |   한  |   글  |   이 |   가  |  득   
---- | ---- | ---- | ---- | ---- | ----  
hex  | c7d1 | b1db | c0cc | b0a1 | b5e6   
---- | ---- | ---- | ---- | ---- | ----
sub  |   d1 | b1db | c0

当解码器解码 d1b1dbc0 时,它将 d1b1 视为一个字符,并将 dbc0 视为一个字符。这可能会因字符集而异,但在这种情况下,它会改变。除非解码器知道原始字符的字节集,否则解码器将使用错误的字符对其进行解码,因为该字节不知道起始点。

我认为这个方法的关键是如何让解码器知道原始字符的起始位置(以字节为单位)。

2 个答案:

答案 0 :(得分:1)

将整个String转换为byte []并切割数组更容易。然后尝试将数组片段转换回String。如果转换失败,则跳过片段数组的最后一个字节。

答案 1 :(得分:1)

有一种NIO方法。

使用CharsetEncoder#encode,可以将字符串(或者更确切地说是CharBuffer,但转换很简单)编码为字节数组(实际上是ByteBuffer),所有这些都是输入中的可能字符将被转换,直到输入完全处理完毕,但从不溢出输出。

  

CoderResult.OVERFLOW表示输出缓冲区中没有足够的空间来编码更多字符。应该使用具有更多剩余字节的输出缓冲区再次调用此方法。这通常通过从输出缓冲区中排出任何编码字节来完成。

完成编辑,这是一个例子(虽然我仍然不确定你想要完成什么,这是我最好的猜测),你的字符串한글이가득使用编码EUC-KR。< / p>

首先,让我们看看每个字符的字节数组表示是什么

Char |   한 |   글  |   이 |  가   |  득 
---- | ---- | ---- | ---- | ---- | ----
hex  | c7d1 | b1db | c0cc | b0a1 | b5e6 

所以整个字符串需要写入10个字节

现在,假设我们的消息长度为9个字节。这将允许我们发送한글이가(8个字节),这是0xc7d12b1dbc0ccb0a1,但由于没有足够的空间发送0xb5e6需要2个字节,我们只有一个),其余的缓冲区应该是空白的。

确实:

String testData = "한글이가득";
CharsetEncoder encoder = charset.newEncoder();
// We create a 9 bytes buffer
ByteBuffer limitedSizeOutput = ByteBuffer.allocate(9);
// We encode
CoderResult coderResult = encoder.encode(CharBuffer.wrap(testData.toCharArray()), limitedSizeOutput, true);
// The encoder tells us that it could not fit the whole chars in 9 bytes
System.out.println(coderResult); // prints OVERFLOW
// We can check that it encoded 8 bytes out of the 10 that compose the original string data
limitedSizeOutput.flip();
System.out.println(limitedSizeOutput.limit()); // prints 8
// We can see that these are in effect 한글이가 by reading the uffer
System.out.println(charset.newDecoder().decode(limitedSizeOutput).toString());