Php,检测utf-8字符的可能输出编码

时间:2014-02-23 11:14:19

标签: php encoding utf-8 output

我正在尝试将php字符串从utf-8解码为所需的编码(iso-8859-2)。问题是,utf-8字符串的字符不适合iso-8859-2,但是从windows-1251转换为utf-8(尽管它们看起来与ISO-8859的本机字符完全相同 - 2)。那些字符用“?”表示在输出上。

如果我尝试将相同的字符串转换为windows-1251,则会显示相同的字符,但是缺少的字符分别是iso-8859-2的原生字符(如“ä”,“ö”等)

我从mysql数据库中获取字符串并需要转换为非unicode字符集并将它们存储到sqlite数据库文件中,因为将要使用它们的程序不支持unicode。

所以,我的问题是有没有办法在utf-8中为字符获取可能的非unicode编码?我目前正在遍历整个utf字符串并尝试逐个解码每个字符,但是Windows-1251字符仍然缺失。

代码如下:


$string = "various charset input";

$str = str_split_unicode($string,1); // The function from the php-str_split manual page, splits utf string into an array

$handler = "";

foreach($str as $value):
    $currentChar = iconv("utf-8", "iso-8859-2", $value) or "%no%";

    if($currentChar == "%no%" ):
        $currentChar = ""; 
        $currentChar = iconv("utf-8", "windows-1251", $value) or "%no%";
    endif;

    if($currentChar != "%no%"):

        $handler .= $currentChar;

    else:

        $handler .= $value;

    endif;

endforeach;

$string = $handler;

但问号仍然存在。

  

更新

感谢CertaiN,我编辑了你提供的功能(虽然它可能变得不那么可读),所以它将字符转换回适当的编码。

功能



    function utf8_to_multicharset($str, $encoding, $htmSupportedOutput="iso-8859-15") {

        $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
        $out = $utf8;
        mb_convert_variables($encoding, 'UTF-8', $out);

    is_array($htmSupportedOutput) or $htmSupportedOutput = explode(",",$htmSupportedOutput);

        $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);

        foreach ($out as $i => &$char) {

            if ($char === '?' && $utf8[$i] !== '?') {

                $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');

            } 
            elseif (isset($table[$char])) {

                $char = $table[$char];

            }


        foreach($htmSupportedOutput as $o):

            $char = html_entity_decode($char,null,$o);

        endforeach;
        }

    return implode('', $out);
    }

现在它从指定的编码列表中进行检查,并将字符串转换为支持它的编码,如下所示:

实施例

Php用法:


    <?php
       $string = "vatiöus charset иnput";
       $result = utf8_to_multicharset($string,"iso-8859-2","cp1252,cp1251,koi8r");
    ?>

1 个答案:

答案 0 :(得分:0)

您需要 HTML实体编码吗?

功能

function utf8_to_escaped_another($str, $encoding) {
    $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
    $out = $utf8;
    mb_convert_variables($encoding, 'UTF-8', $out);
    $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
    foreach ($out as $i => &$char) {
        if ($char === '?' && $utf8[$i] !== '?') {
            $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
        } elseif (isset($table[$char])) {
            $char = $table[$char];
        }
    }
    return implode('', $out);
}

实施例

PHP源代码

<?php

function utf8_to_escaped_another($str, $encoding) {
    $utf8 = preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
    $out = $utf8;
    mb_convert_variables($encoding, 'UTF-8', $out);
    $table = get_html_translation_table(HTML_SPECIALCHARS | ENT_QUOTES);
    foreach ($out as $i => &$char) {
        if ($char === '?' && $utf8[$i] !== '?') {
            $char = mb_convert_encoding($utf8[$i], 'HTML-ENTITIES', 'UTF-8');
        } elseif (isset($table[$char])) {
            $char = $table[$char];
        }
    }
    return implode('', $out);
}

header('Content-Type: text/html; charset=ISO-8859-2');

$text = <<<EOD
English: Good Morning
Arabic: صباح الخير
Japanese: おはよう
EOD;

echo '<pre>';
echo utf8_to_escaped_another($text, 'ISO-8859-2');
echo '</pre>';

HTML视图

English: Good Morning
Arabic: صباح الخير
Japanese: おはよう

HTML源代码

<pre>English: Good Morning
Arabic: &#1589;&#1576;&#1575;&#1581; &#1575;&#1604;&#1582;&#1610;&#1585;
Japanese: &#12362;&#12399;&#12424;&#12358;</pre>