我无法解决这个问题而且我变得疯狂。
JSON_encode()
在一组10k记录中的几条记录(2或3)上输出错误:Malformed UTF-8 characters, possibly incorrectly encoded
。
然而,这似乎很难解决。
我还可以在简单的HTML调试页面中打印以正确筛选PHP中的记录而不会出现问题。但是,如果我尝试用JSON编码,我会收到错误。
我发现这些记录是从CVS导入的,可能绕过了清洁工。奇怪的是整个CSV文件解析为:
$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);
if ($this->encoding!="" && $this->encoding!="UTF8") {
$source = iconv($this->encoding, "UTF-8", $source);
}
由于隐私(和GDPR),我无法发布任何完整的破损数据。 然而,我成功地提取了一个似乎是破碎的部分:
RESIDENCE �PRINCIPE
更新
我试图获得这些破碎的字符的bitcode。这是我发现的。
在ASCII中,通过使用简单的本机函数str_split
和ord
,这些char是:
'�' 160
我想在utf8中找到bitcode,所以我在PHP.net上找到了这个有用的函数http://php.net/manual/en/function.ord.php#109812 哪个尝试找到MultiByteStrings的bitcode。它给了我:
-2096
哪个是.......否定?
答案 0 :(得分:2)
<强>解决!
问题出在函数mb_detect_order()
中,这个函数不能像我期望的那样工作。我认为这是一个完全支持编码顺序的列表,主要用于加快检测过程。
但我发现这个函数只返回2个编码:
//print_r(mb_detect_order());
Array
(
[0] => ASCII
[1] => UTF-8
)
在我的情况下,这几乎完全没用。
MB功能可以检测更多的字符集。
您可以通过运行mb_list_encodings()
查看它们并获取完整列表:
//print_r(mb_list_encodings());
Array
(
[0] => pass
[1] => auto
[2] => wchar
[3] => byte2be
[4] => byte2le
[5] => byte4be
[6] => byte4le
[7] => BASE64
[8] => UUENCODE
[9] => HTML-ENTITIES
[10] => Quoted-Printable
[11] => 7bit
[12] => 8bit
[13] => UCS-4
[14] => UCS-4BE
[15] => UCS-4LE
[16] => UCS-2
[17] => UCS-2BE
[18] => UCS-2LE
[19] => UTF-32
[20] => UTF-32BE
[21] => UTF-32LE
[22] => UTF-16
[23] => UTF-16BE
[24] => UTF-16LE
[25] => UTF-8
[26] => UTF-7
[27] => UTF7-IMAP
[28] => ASCII
[29] => EUC-JP
[30] => SJIS
[31] => eucJP-win
[32] => EUC-JP-2004
[33] => SJIS-win
[34] => SJIS-Mobile#DOCOMO
[35] => SJIS-Mobile#KDDI
[36] => SJIS-Mobile#SOFTBANK
[37] => SJIS-mac
[38] => SJIS-2004
[39] => UTF-8-Mobile#DOCOMO
[40] => UTF-8-Mobile#KDDI-A
[41] => UTF-8-Mobile#KDDI-B
[42] => UTF-8-Mobile#SOFTBANK
[43] => CP932
[44] => CP51932
[45] => JIS
[46] => ISO-2022-JP
[47] => ISO-2022-JP-MS
[48] => GB18030
[49] => Windows-1252
[50] => Windows-1254
[51] => ISO-8859-1
[52] => ISO-8859-2
[53] => ISO-8859-3
[54] => ISO-8859-4
[55] => ISO-8859-5
[56] => ISO-8859-6
[57] => ISO-8859-7
[58] => ISO-8859-8
[59] => ISO-8859-9
[60] => ISO-8859-10
[61] => ISO-8859-13
[62] => ISO-8859-14
[63] => ISO-8859-15
[64] => ISO-8859-16
[65] => EUC-CN
[66] => CP936
[67] => HZ
[68] => EUC-TW
[69] => BIG-5
[70] => CP950
[71] => EUC-KR
[72] => UHC
[73] => ISO-2022-KR
[74] => Windows-1251
[75] => CP866
[76] => KOI8-R
[77] => KOI8-U
[78] => ArmSCII-8
[79] => CP850
[80] => JIS-ms
[81] => ISO-2022-JP-2004
[82] => ISO-2022-JP-MOBILE#KDDI
[83] => CP50220
[84] => CP50220raw
[85] => CP50221
[86] => CP50222
)
我错了,认为mb_detect_order
只是此列表的有序版本。 mb_detect_order
只是......毫无用处。为了以正确的方式编码UTF8,请使用以下代码:
$my_encoding_list = [
"UTF-8",
"UTF-7",
"UTF-16",
"UTF-32",
"ISO-8859-16",
"ISO-8859-15",
"ISO-8859-10",
"ISO-8859-1",
"Windows-1254",
"Windows-1252",
"Windows-1251",
"ASCII",
//add yours preferred
];
//remove unsupported encodings
$encoding_list = array_intersect($my_encoding_list, mb_list_encodings());
//detect 'finally' the encoding
$this->encoding = mb_detect_encoding($source,$encoding_list,true);
这解决了我在数据库中保存的错误数据的问题。
答案 1 :(得分:1)
您可以使用UTF-8//IGNORE
方法中的iconv
字符集来过滤这些未知字符。
$this->encoding = mb_detect_encoding($source,mb_detect_order(),true);
if ($this->encoding!="" && $this->encoding!="UTF8") {
$source = iconv($this->encoding, "UTF-8//IGNORE", $source);
}
在charset之后使用//IGNORE
,将无声地丢弃目标字符集中无法表示的每个字符。