我有一个函数'remove4BytesUTF8Char()'来删除社交媒体中出现的一些独特字符,但它不能完成这项工作。我可以删除很多其他角色,但不能删除这个角色。如何从我的String中专门摆脱这个?
String str = "very good\uE056 flavor";
System.out.println("str before remove: " + str);
str = UTF8Utils.remove4BytesUTF8Char(str);
System.out.println("str after remove " + str);
输出如下:
str before remove: very good flavor
str after remove very good flavor
编辑:
public static String remove4BytesUTF8Char(String s) {
byte[] bytes = s.getBytes();
byte[] removedBytes = new byte[bytes.length];
int index = 0;
String hex;
String firstChar;
for (int i = 0; i < bytes.length; ) {
hex = UTF8Utils.byteToHex(bytes[i]);
if (hex.length() < 2) {
System.out.println("fail to check whether contains 4 bytes char(1 byte hex char too short), default return false.");
// todo, throw exception for this case
return null;
}
firstChar = hex.substring(0, 1);
if (byteMap.get(firstChar) == null) {
System.out.println("fail to check whether contains 4 bytes char(no firstchar mapping), default return false.");
// todo, throw exception for this case
return null;
}
if (firstChar.equals("f")) {
for (int j = 0; j < byteMap.get(firstChar); j++) {
i++;
}
continue;
}
for (int j = 0; j < byteMap.get(firstChar); j++) {
removedBytes[index++] = bytes[i++];
}
}
return new String(Arrays.copyOfRange(removedBytes, 0, index));
}
答案 0 :(得分:0)
您可以将String
视为char
的数组,然后查看每个char
是否大于127,因为那是largest value for ascii所以更高的会是public static void main(String...args){
String str = "very good\uE056 flavor";
System.out.println(str);
System.out.println(removeUTF8(str));
}
public static String removeUTF8(String s){
for(int i = 0; i < s.length(); i++){
char targetChar = s.charAt(i);
if(targetChar > 127){
s = s.replaceAll(Character.toString(targetChar), "");
}
}
return s;
}
成为utf8的一部分
const second = document.querySelector(".second .blobtext");
const third = document.querySelector(".third .blobtext");
const fourth = document.querySelector(".fourth .blobtext");
const blobs = [second, third, fourth];
const blobAssign = [
"ur a blob",
"ur both blobs",
"no ur a blob",
];
second.innerHTML = blobAssign[0];
third.innerHTML = blobAssign[1];
fourth.innerHTML = blobAssign[2];
// var copyArrayToHtml = function (aone,atwo){
// var i,j=0;
// for( i=0;i < atwo.length ; i++){
// for( j=0;j < aone.length ; j++){
// aone[j].innerHTML = atwo[i];
// }
// }
// }
//the solution above resolves what you say in the statement.
//this solution always is going to be override with the last value to avoid that you can do the following
var copyArrayToHtml = function (aone,atwo){
var i,j=0;
for( i=0;i < aone.length ; i++){
for( j=0;j < atwo.length ; j++){
const content = (aone[j].innerHTML) ? aone[j].innerHTML : '';
aone[j].innerHTML = content + ' ' +atwo[i];
}
}
}
copyArrayToHtml(blobs, blobAssign);
答案 1 :(得分:0)
所有char,Character和String都使用Unicode的UTF-16编码。每个代码点以一个或两个代码单元(char
)编码。两个用于&gt; = U + 10000。 Clause D91
UTF-8是Unicode的另一种编码。每个代码点都以一个,两个,三个或四个代码单元进行编码,(序列化它们时为byte
)。四为> = U + 10000。 Table 3-7
因此,如果你想过滤掉UTF-8用4个字节编码的代码点,那就像过滤掉UTF-16用2个字符编码的代码点一样。
现在,UTF-16在2个字符中编码的任何代码点的2个字符总是在&#39; \ uD800&#39;到&#39; \ uDFFF&#39;。 (它们对应于保留以防止混淆的surrogate codepoints。)
好的,这就是你的remove4BytesUTF8Char函数要处理的内容。但是,&#39; \ uE056&#39;实际上,不是UTF-8以4字节编码的Unicode码点的UTF-16代码单元。它位于Unicode Private Use Area块中(向左下滚动):U + E000到U + F8FF(&#39; \ uE000&#34;到&#39; \ uF8FF&# 39)。因此,您必须单独过滤掉这些内容。
String input = "very good\uE056 flavor ";
System.out.println(input);
String output = input.chars() // IntStream of UTF-16 code units
.filter(c -> !Character.isSurrogate((char)c)
&& Character.getType((char)c) != Character.PRIVATE_USE)
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append)
.toString();
System.out.println(output);