使用PHP将UCS-2文件转换为UTF-8

时间:2012-06-08 03:59:59

标签: php encoding

我有一个客户端提供的CSV文件,必须使用PHP解析并插入到数据库中。

在将数据插入数据库之前,我想将其转换为UTF-8,但我似乎无法找到它。

这是我试图检测文件编码的原因:

$ enca -d -L zh ./artigos.txt 
    ./artigos.txt: Universal character set 2 bytes; UCS-2; BMP
    CRLF line terminators
    Byte order reversed in pairs (1,2 -> 2,1)

我尝试使用iconv功能,但它会使转换混乱并显示与原始字符不同的字符结果。

文件的第一行(base64编码):

IgAwADMAMQAxADkAIgAsACIANwAzADEAMwA0ADYAMgA2ADQAMAAwADEANQAiACwAIgBBAGcAcgBhAGYAYQBkAG8AcgAgAFIAYQBwAGkAZAAgADkAIABIAGUAYQB2AHkAIABEAHUAdAB5ACIALAAiAEEAZwByAGEAZgBvACAAOQAvADgALAAgADkALwAxADAALAAgADkALwAxADIALAAgADkALwAxADQAIgAsACIAMQAxADAAZgBsAHMAIgAsACIAIgAsACIAIgAsACIAIgAsACIAMAAzADEAMQA5AC4AagBwAGcAIgAsACIAIgAsACIAMQAsADIAMAAiACwAIgA1ADkALAA5ADAAIgAsACIAMgAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIARgBhAGwAcwBlACIADQAK

3 个答案:

答案 0 :(得分:3)

Microsoft Excel CSV通常是Little Endian编码的(我花了很长时间才发现)。 如果你想用它们,例如fgetcsv你应该把文件转换成UTF-8之前。 我做了以下事情:

        $str=file_get_contents($file);
        $str= mb_convert_encoding($str, 'UTF-8', 'UCS-2LE'); 
        file_put_contents("converted_".$file, $str);

答案 1 :(得分:2)

这似乎有效(小端),虽然你没有包含任何非ascii字符

$s='IgAwADMAMQAxADkAIgAsACIANwAzADEAMwA0ADYAMgA2ADQAMAAwADEANQAiACwAIgBBAGcAcgBhAGYAYQBkAG8AcgAgAFIAYQBwAGkAZAAgADkAIABIAGUAYQB2AHkAIABEAHUAdAB5ACIALAAiAEEAZwByAGEAZgBvACAAOQAvADgALAAgADkALwAxADAALAAgADkALwAxADIALAAgADkALwAxADQAIgAsACIAMQAxADAAZgBsAHMAIgAsACIAIgAsACIAIgAsACIAIgAsACIAMAAzADEAMQA5AC4AagBwAGcAIgAsACIAIgAsACIAMQAsADIAMAAiACwAIgA1ADkALAA5ADAAIgAsACIAMgAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIAMAAiACwAIgAwACIALAAiADAAIgAsACIARgBhAGwAcwBlACIADQAK';
$t=base64_decode($s);
echo iconv('UCS-2LE', 'UTF-8', substr($t, 0, -1));//last byte was invalid

答案 2 :(得分:0)

python:

编码的方法之一是

  

文字 - > utf-16-be - >十六进制

转换回来

  

十六进制到二进制,然后从utf-16-be到文本

注意:不推荐使用ucs-2be并转移到utf-16-be

解码器

('2017-01-02', 1), 
('2017-03-24', 1),
('2017-04-03', 1), 
('2017-05-24', 1),
('2017-12-14', 1)