如何删除这个非标准的unicode字符?

时间:2017-11-07 00:46:09

标签: java unicode

我有一个函数'remove4BytesUTF8Char()'来删除社交媒体中出现的一些独特字符,但它不能完成这项工作。我可以删除很多其他角色,但不能删除这个角色。如何从我的String中专门摆脱这个?

        String str = "very good\uE056 flavor";
        System.out.println("str before remove: " + str);
        str = UTF8Utils.remove4BytesUTF8Char(str);
        System.out.println("str after remove " + str);

输出如下:

str before remove: very good flavor
str after remove very good flavor

编辑:

public static String remove4BytesUTF8Char(String s) {
        byte[] bytes = s.getBytes();
        byte[] removedBytes = new byte[bytes.length];
        int index = 0;

        String hex;
        String firstChar;
        for (int i = 0; i < bytes.length; ) {
            hex = UTF8Utils.byteToHex(bytes[i]);

            if (hex.length() < 2) {
                System.out.println("fail to check whether contains 4 bytes char(1 byte hex char too short), default return false.");
                // todo, throw exception for this case
                return null;
            }

            firstChar = hex.substring(0, 1);

            if (byteMap.get(firstChar) == null) {
                System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false.");
                // todo, throw exception for this case
                return null;
            }

            if (firstChar.equals("f")) {
                for (int j = 0; j < byteMap.get(firstChar); j++) {
                    i++;
                }
                continue;
            }

            for (int j = 0; j < byteMap.get(firstChar); j++) {
                removedBytes[index++] = bytes[i++];
            }
        }

        return new String(Arrays.copyOfRange(removedBytes, 0, index));
    }

2 个答案:

答案 0 :(得分:0)

您可以将String视为char的数组,然后查看每个char是否大于127,因为那是largest value for ascii所以更高的会是public static void main(String...args){ String str = "very good\uE056 flavor"; System.out.println(str); System.out.println(removeUTF8(str)); } public static String removeUTF8(String s){ for(int i = 0; i < s.length(); i++){ char targetChar = s.charAt(i); if(targetChar > 127){ s = s.replaceAll(Character.toString(targetChar), ""); } } return s; } 成为utf8的一部分

const second = document.querySelector(".second .blobtext");
const third = document.querySelector(".third .blobtext");
const fourth = document.querySelector(".fourth .blobtext");

const blobs = [second, third, fourth];

const blobAssign = [
    "ur a blob",
    "ur both blobs",
    "no ur a blob",
];

second.innerHTML = blobAssign[0];
third.innerHTML = blobAssign[1];
fourth.innerHTML = blobAssign[2];


// var copyArrayToHtml = function (aone,atwo){
//     var i,j=0;
//     for( i=0;i < atwo.length ; i++){
//         for( j=0;j < aone.length ; j++){
//             aone[j].innerHTML = atwo[i];
//         }
//     }
// }

//the solution above resolves what you say in the statement.
//this solution always is going to be override with the last value to avoid that you can do the following
var copyArrayToHtml = function (aone,atwo){
    var i,j=0;
    for( i=0;i < aone.length ; i++){
        for( j=0;j < atwo.length ; j++){
            const content = (aone[j].innerHTML) ? aone[j].innerHTML : '';
            aone[j].innerHTML = content + ' ' +atwo[i];
        }
    }
}

copyArrayToHtml(blobs, blobAssign);

答案 1 :(得分:0)

所有char,Character和String都使用Unicode的UTF-16编码。每个代码点以一个或两个代码单元(char)编码。两个用于&gt; = U + 10000。 Clause D91

UTF-8是Unicode的另一种编码。每个代码点都以一个,两个,三个或四个代码单元进行编码,(序列化它们时为byte)。四为> = U + 10000。 Table 3-7

因此,如果你想过滤掉UTF-8用4个字节编码的代码点,那就像过滤掉UTF-16用2个字符编码的代码点一样。

现在,UTF-16在2个字符中编码的任何代码点的2个字符总是在&#39; \ uD800&#39;到&#39; \ uDFFF&#39;。 (它们对应于保留以防止混淆的surrogate codepoints。)

好的,这就是你的remove4BytesUTF8Char函数要处理的内容。但是,&#39; \ uE056&#39;实际上,不是UTF-8以4字节编码的Unicode码点的UTF-16代码单元。它位于Unicode Private Use Area块中(向左下滚动):U + E000到U + F8FF(&#39; \ uE000&#34;到&#39; \ uF8FF&# 39)。因此,您必须单独过滤掉这些内容。

String input = "very good\uE056 flavor ";
System.out.println(input);
String output = input.chars() // IntStream of UTF-16 code units
    .filter(c -> !Character.isSurrogate((char)c) 
                 && Character.getType((char)c) != Character.PRIVATE_USE)
    .collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
    .toString(); 
System.out.println(output);