非法的unicode字符

时间:2013-06-24 12:56:35

标签: php json unicode amazon-web-services amazon-cloudsearch

我正在尝试将document.sdf(json)发送到Amazon Cloud Search。一切正常,直到有一些特殊字符

Found Unicode characters that are not legal for Cloud Search:\n Illegal Unicode character '\u0002'\n Illegal Unicode character '\u0010'\n Illegal Unicode character '\u0001'\n Illegal Unicode character '\b'

错误来自这篇文章:

...sadad<br \/>\n;color:G\u0002% k\u0010>\u0001\b? X_? p>", ...

这些来自由PHP脚本和json_encoded

生成的document.sdf

以上原文:

  

;颜色:G%k&gt;? X_? p为H.

1 个答案:

答案 0 :(得分:1)

使用正则表达式从文本中删除所有无效字符可能是值得的:

[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD\uD800\uDC00-\uDBFF\uDFFF]

但是当我遇到类似的问题时,问题只是我在进行POST时没有明确指定字符编码,例如:

$curl = curl_init($cloudsearch_url);
curl_setopt($curl, CURLOPT_HTTPHEADER, 
            array('Content-Type: application/json; charset=UTF-8')); //Defaults to ISO10646 (I think) without this
curl_setopt($curl, CURLOPT_POST, true);
curl_setopt($curl, CURLOPT_POSTFIELDS, $post_data);
curl_exec($curl);