如何检测UTF 16解码

时间:2014-04-12 07:19:27

标签: php character-encoding utf-16 utf

我必须读取一个文件并识别其解码类型,我使用mb_detect_encoding()来检测utf-16但是结果出错了..如何在php中检测utf-16编码类型。

Php文件是utf-16,我的标题是windows-1256(因为阿拉伯语)

header('Content-Type: text/html; charset=windows-1256');

$delimiter = '\t';
$f= file("$fileName");

 foreach($f as $dailystatmet)
{
    $transactionData = str_replace("'", '', $dailystatmet);
    preg_match_all("/('?\d+,\d+\.\d+)?([a-zA-Z]|[0-9]|)[^".$delimiter."]+/",$transactionData,$matches);

        array_push($matchesz, $matches[0]);


}

$searchKeywords = array ("apple", "orange", 'mango');

$rowCount = count($matchesz);

for ($row = 1; $row <= $rowCount; $row++) {
    $myRow = $row;
    $cell = $matchesz[$row];



    foreach ($searchKeywords as $val) {

        if (partialArraySearch($cell[$c_description], $val)) {

          }
       }}



function partialArraySearch($cell, $searchword)
{

    if (strpos(strtoupper($cell), strtoupper($searchword)) !== false) {

        return true;
    }

    return false;
}

上面的代码用于在上传的文件中进行搜索..如果文件是在utf-8中,则匹配正在获得但是当使用utf-16或utf-32的相同文件时没有得到结果..

那么如何才能获得上传文件的编码类型..

2 个答案:

答案 0 :(得分:1)

如果有人还在寻找解决方案,我在github上的“voku / portable-utf8”repo中破解了类似的东西。 =&GT; “UTF8 ::的file_get_contents()”

“file_get_contents”-wrapper将通过“UTF8 :: str_detect_encoding()”检测当前编码,并将文件内容自动转换为UTF-8。

例如:来自PHPUnit测试...

$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16pe.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);

$testString = UTF8::file_get_contents(dirname(__FILE__) . '/test1Utf16le.txt');
$this->assertContains('<p>Today’s Internet users are not the same users who were online a decade ago. There are better connections.', $testString);

答案 1 :(得分:1)

我的解决方案是检测UTF-16并转换拉丁语15中的代码

  preg_match_all('/\x00/',$content,$count);
  if(count($count[0])/strlen($content)>0.4) {
     $content = iconv('UTF-16', 'ISO-8859-15', $content);
  }

换句话说,我检查十六进制字符00的频率。如果它高于0.4,则文本可能包含以UTF-16编码的基本集中的字符。这意味着字符有两个字节,但通常第二个字节是00。