Question

我在将Windows-1257文件转换为UTF-8时遇到问题。原始文件有首先lib/，然后尝试使用以下代码对其进行转换：

<?xml version="1.0" encoding="windows-1257"?>

它将文件另存为UTF-8，但是当我打开该文件时，仍然出现错误：

XML解析错误：输入的UTF-8不正确，表示编码！字节：0x04 0x50 0x72 0x65

我是否有任何适当的方法可以将其转换为可读的UTF-8，或者这意味着文件中仍然存在一些不在UTF-8上的符号？

Answer 1

您正在尝试将UTF8转换为UTF8//IGNORE，这就是为什么您会收到该错误的原因。第一个参数是in_charset。 iconv on PHP.net请更改

$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);

到

$unicode_xml = iconv("CP1257", "UTF-8//IGNORE", $baltic_xml);

但是，我个人建议您使用mb_*，因为iconv严重依赖于操作系统对iconv的实现，并且可以显示操作系统之间的差异，而mb_ *是纯php扩展名，并且是一致的。使您的代码使用mb_ *会整体变为

ini_set('mbstring.substitute_character','none'); //to remove the unknown characters, in place of //IGNORE in iconv
$baltic_xml = file_get_contents($remote_file);
$unicode_xml = iconv("UTF-8", "UTF-8//IGNORE", $baltic_xml);
$unicode_xml = utf8_encode($unicode_xml); //to correct utf-8 bytes
$unicode_xml = preg_replace('/[^\PC\s]/u', '', $unicode_xml); //to remove control chars in case it has
file_put_contents('data/rmtools/import/utf8/' . $files_single, $unicode_xml);

根据mb supported encodings CP-1257不是不是其中之一，您可以改用ISO-8859-13，但是请注意，它们之间在某些图形字符中存在一些不一致之处（不过，根据wikipedia，语言字符似乎是一致的）

PHP将Windows-1257编码为UTF-8错误

1 个答案: