Question

我正在尝试为这个采访问题写一个简单的程序：

编写一个检查有效unicode字节序列的函数。一个unicode 序列编码为： - 第一个字节表示后续字节的数量 '11110000'表示4个后续数据字节 - 数据字节以a开头 '10xxxxxx'

   public static void main(String[] args)
{

        System.out.println(checkUnicode(new byte[] {(byte)'c'}));

}

    /**
     * Write a function that checks for valid unicode byte sequence. A unicode
     * sequence is encoded as: - first byte indicates number of subsequent bytes
     * '1111000' means 4 subsequent data bytes - data bytes start with a
     * '10xxxxxx'
     * 
     * @param unicodeChar
     * @return
     */
 public static boolean checkUnicode(byte[] unicodeChar)
{
    byte b = unicodeChar[0];
    int len = 0;

    int temp = (int)b<<1;
    while((int)temp<<1 == 0)
    {
        len++;
    }
    System.out.println(len);

    if (unicodeChar.length == len) 
    {
        for(int i = 1 ; i < len; i++)
        {
            // Check if Most significant 2 bits in the byte are '10'
            // c0, in base 16, is 11000000 in binary
            // 10000000, in base 2, is 128 in decimal
            if( ( (int)unicodeChar[i]&0Xc0 )==128 )
            {
                continue;
            }
            else
            {
                return false;
            }
        }
        return true;
    }
    else
    {
        return false;
    }
}

The output I get is   
99
false

根据Chris Jester-Young的评论改变了从char到byte数组的转换。

有人能指出我正确的方向

由于

根据Ted Hopp的输入做了一些修改 P.S：
我从一些论坛得到了问题，我认为它没有在那里正确发布，但是我仍然决定解决它并使用它以防止更多地混淆它，因为我完全不理解它！

Answer 1

这是适用于企业级作业的企业级解决方案：

public static void main(String[] args) {
    if (args.length == 0 || args[0] == null || (args[0] = args[0].trim()).isEmpty()) {
        System.out.println("No argument passed or argument empty!");
        return;
    }

    String arg = args[0];
    System.out.println("arg: " + arg + ", arg len: " + arg.length());

    BitSet bs = new BitSet(arg.length());
    for (int i = 0; i < arg.length(); i++) {
        if (arg.charAt(i) == '1') {
            bs.set(i, true); 
        }
    }
    ByteBuffer bb = ByteBuffer.wrap(bs.toByteArray());
    Charset cs = Charset.forName("UTF-8");
    CharsetDecoder csd =
            cs.newDecoder().onMalformedInput(CodingErrorAction.REPORT).
            onUnmappableCharacter(CodingErrorAction.REPORT)
            ;

    try {
        CharBuffer cb = csd.decode(bb);
        String uns = cb.toString();
        System.out.println("Got unicode string of len " + uns.length() + ": " + uns + " from " + arg + " -- no errors!");
    } catch (CharacterCodingException cce) {
        System.out.println("Invalid UTF-8 unicode string! " + cce.getMessage());
    }
}

验证

public static void test() {
    StringBuilder sb = new StringBuilder();
     byte[] byt = new String("stupid interview").getBytes();
     BitSet byt1 = fromByteArray(byt);
     for (int i = 0; i < byt1.size(); i++) {
         sb.append(byt1.get(i) ? "1" : "0");
     }
     String[] st = new String[1];
     st[0] = sb.toString();
     main(st);
}

public static BitSet fromByteArray(byte[] bytes) {
    BitSet bits = new BitSet();
    for (int i=0; i<bytes.length*8; i++) {
        if ((bytes[bytes.length-i/8-1]&(1<<(i%8))) > 0) {
            bits.set(i);
        }
    }
    return bits;
}

输出：

11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110
arg: 11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110, arg len: 128
{0, 1, 4, 5, 6, 10, 12, 13, 14, 16, 18, 20, 21, 22, 28, 29, 30, 32, 35, 37, 38, 42, 45, 46, 53, 56, 59, 61, 62, 65, 66, 67, 69, 70, 74, 76, 77, 78, 80, 82, 85, 86, 89, 92, 93, 94, 97, 98, 100, 101, 102, 104, 107, 109, 110, 112, 114, 117, 118, 120, 121, 122, 124, 125, 126}
Got unicode string of len 16: stupid interview from 11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110 -- no errors!

Answer 2

首先，问题中提供的UTF-8文档是错误的。没有指定编码就没有“有效的Unicode字节序列”。一个安全的假设是它们意味着UTF-8。第二个（也是更重要的）11110000 不表示另外4个字节的数据。第一个“0”位之前的四个“1”位表示4个字节的总（即3个后续字节，而不是4个，每个字节以“10”开始）。这些规则在the Wikipedia article on UTF-8中有详细描述。

其次，将字符转换为字符串并调用getBytes是一种很好的方法，但是您需要将编码指定为getBytes的参数。（但是，对于角色'c'，这不会产生任何影响。）

我不知道你在代码中想要做什么，但你需要计算在第一个'0'位之前有多少'1'位。你的代码没有做那样的事情。

更新：我实际上并不打算尝试分析位结构。我只是将字节提供给CharsetDecoder并查看它是否会阻塞：

public static boolean checkUnicode(byte[] unicodeChar)
{
    try {
        CharsetDecoder decoder = Charset.forName(UTF-8).newDecoder();
        // test only for malformed input, ignore unknown Unicode characters
        decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.decode(ByteBuffer.wrap(unicodeChar));
        return true;
    }
    catch (MalformedInputException ex)
    {
        return false;
    }
}

Answer 3

如何将你的角色转换为byte，你可以直接投射：

byte[] b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xac};

或者，作为速记：

byte[] b = {(byte) 0xe2, (byte) 0x82, (byte) 0xac};

Answer 4

您可以使用Character.toCodePoint()获取int，然后int到byte应该很容易。

Unicode字节序列/将char转换为bytes数组

4 个答案: