Question

我有一个函数'remove4BytesUTF8Char（）'来删除社交媒体中出现的一些独特字符，但它不能完成这项工作。我可以删除很多其他角色，但不能删除这个角色。如何从我的String中专门摆脱这个？

        String str = "very good\uE056 flavor";
        System.out.println("str before remove: " + str);
        str = UTF8Utils.remove4BytesUTF8Char(str);
        System.out.println("str after remove " + str);

输出如下：

str before remove: very good flavor
str after remove very good flavor

编辑：

public static String remove4BytesUTF8Char(String s) {
        byte[] bytes = s.getBytes();
        byte[] removedBytes = new byte[bytes.length];
        int index = 0;

        String hex;
        String firstChar;
        for (int i = 0; i < bytes.length; ) {
            hex = UTF8Utils.byteToHex(bytes[i]);

            if (hex.length() < 2) {
                System.out.println("fail to check whether contains 4 bytes char(1 byte hex char too short), default return false.");
                // todo, throw exception for this case
                return null;
            }

            firstChar = hex.substring(0, 1);

            if (byteMap.get(firstChar) == null) {
                System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false.");
                // todo, throw exception for this case
                return null;
            }

            if (firstChar.equals("f")) {
                for (int j = 0; j < byteMap.get(firstChar); j++) {
                    i++;
                }
                continue;
            }

            for (int j = 0; j < byteMap.get(firstChar); j++) {
                removedBytes[index++] = bytes[i++];
            }
        }

        return new String(Arrays.copyOfRange(removedBytes, 0, index));
    }

Answer 1

您可以将String视为char的数组，然后查看每个char是否大于127，因为那是largest value for ascii所以更高的会是public static void main(String...args){ String str = "very good\uE056 flavor"; System.out.println(str); System.out.println(removeUTF8(str)); } public static String removeUTF8(String s){ for(int i = 0; i < s.length(); i++){ char targetChar = s.charAt(i); if(targetChar > 127){ s = s.replaceAll(Character.toString(targetChar), ""); } } return s; }成为utf8的一部分

const second = document.querySelector(".second .blobtext");
const third = document.querySelector(".third .blobtext");
const fourth = document.querySelector(".fourth .blobtext");

const blobs = [second, third, fourth];

const blobAssign = [
    "ur a blob",
    "ur both blobs",
    "no ur a blob",
];

second.innerHTML = blobAssign[0];
third.innerHTML = blobAssign[1];
fourth.innerHTML = blobAssign[2];


// var copyArrayToHtml = function (aone,atwo){
//     var i,j=0;
//     for( i=0;i < atwo.length ; i++){
//         for( j=0;j < aone.length ; j++){
//             aone[j].innerHTML = atwo[i];
//         }
//     }
// }

//the solution above resolves what you say in the statement.
//this solution always is going to be override with the last value to avoid that you can do the following
var copyArrayToHtml = function (aone,atwo){
    var i,j=0;
    for( i=0;i < aone.length ; i++){
        for( j=0;j < atwo.length ; j++){
            const content = (aone[j].innerHTML) ? aone[j].innerHTML : '';
            aone[j].innerHTML = content + ' ' +atwo[i];
        }
    }
}

copyArrayToHtml(blobs, blobAssign);

Answer 2

所有char，Character和String都使用Unicode的UTF-16编码。每个代码点以一个或两个代码单元（char）编码。两个用于＆gt; = U + 10000。 Clause D91

UTF-8是Unicode的另一种编码。每个代码点都以一个，两个，三个或四个代码单元进行编码，（序列化它们时为byte）。四为> = U + 10000。 Table 3-7

因此，如果你想过滤掉UTF-8用4个字节编码的代码点，那就像过滤掉UTF-16用2个字符编码的代码点一样。

现在，UTF-16在2个字符中编码的任何代码点的2个字符总是在＆＃39; \ uD800＆＃39;到＆＃39; \ uDFFF＆＃39;。（它们对应于保留以防止混淆的surrogate codepoints。）

好的，这就是你的remove4BytesUTF8Char函数要处理的内容。但是，＆＃39; \ uE056＆＃39;实际上，不是UTF-8以4字节编码的Unicode码点的UTF-16代码单元。它位于Unicode Private Use Area块中（向左下滚动）：U + E000到U + F8FF（＆＃39; \ uE000＆＃34;到＆＃39; \ uF8FF＆＃ 39）。因此，您必须单独过滤掉这些内容。

String input = "very good\uE056 flavor ";
System.out.println(input);
String output = input.chars() // IntStream of UTF-16 code units
    .filter(c -> !Character.isSurrogate((char)c) 
                 && Character.getType((char)c) != Character.PRIVATE_USE)
    .collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
    .toString(); 
System.out.println(output);

如何删除这个非标准的unicode字符？

2 个答案: