如何在PHP中读取非字符,如阿拉伯语单词

时间:2014-11-05 21:30:33

标签: php utf-8 utf8-decode

我正在将只有一列的CSV文件转换为PHP数组。 Csv列有一些非英文字符,如阿拉伯字符,但PHP正在正确阅读英文字符,但阿拉伯字符被更改为一些数字。这是我的代码示例

 <?php

    function str_to_csv( $row )
    {
    if( $row=='' )
    {
    return array();
    }
    $a = array();
    $src = explode(',', $row );
    do{
    $p = array_shift($src);
    while( mb_substr_count($p,'"') % 2 != 0 )
    {
    if( count($src)==0 ){ return false; }
    $p .= ','.array_shift($src);
    }
    $match = null;
    if( preg_match('/^"(.+)"[
    ]*$/', $p, $match ))
    {
    $p = $match[1];
    }
    $a[] = str_replace('""','"',$p);
    }while( count($src) > 0 );
    return $a;
    }


    function file_getcsv( $f )
    {
    $line = fgets( $f );
    while( ($a = str_to_csv($line))===false )
    {
    if( feof($f) ){ return false; }
    $line .= "\n".fgets( $f );
    }
    return $a;
    }


    function file_to_csv( $filename )
    {
    ini_set("auto_detect_line_endings", true);
    $a = array();
    $f = fopen($filename,'r');
    while( !feof($f) )
    {
    $rec = file_getcsv($f);
    if( $rec===false ){ return false; }
    if( !empty($rec) )
    {
    $a[] = $rec;
    }
    }
    fclose($f);
    return $a;
    }

    $data = file_to_csv('club3.csv');

    echo '<pre>';print_r($data);
    ?>

这是我的excel示例,它只有一列

....
    Royal Kings
    Mere Cats
    Spin Doctors
    رأس العين
....

当我在做var_dump时,数组就像那样

...
Royal Kings
        )

    [32935] => Array
        (
            [0] => 
Mere Cats
        )

    [32936] => Array
        (
            [0] => 
Spin Doctors
        )

    [32937] => Array
        (
            [0] => 
1#3 'D9JF
        )
...

1 个答案:

答案 0 :(得分:0)

阿拉伯语有一个很大的问题 问题不仅在于读取字母,而且阿拉伯语包含几种形式的字母,单词在开头,中间和结尾 例: ب,ب,ب,ب 修复你可以使用unicode 使用此功能修复所有问题

注意:此功能由:Abd AL-Latif创建 在github上看 - &gt; https://goo.gl/m8pkGx 功能:

function fixArabicCharactersAndCreateProbablyString($word)
{

    $new_word = array();
    $char_type = array();
    $isolated_chars = array('ا', 'د', 'ذ', 'أ', 'آ', 'ر', 'ؤ', 'ء', 'ز', 'و', 'ى', 'ة', 'إ');
    $alef = array('أ','ا','إ','آ');
    $lam = array('ل');
    $al_char = array();
    $all_chars = array
        (
            'ا' => array(
                'la_beg'        =>   '&#xFEFB;',
                'la_end'        =>   '&#xFEFC;',
                'middle'        =>   '&#xFE8E;',

                'isolated'      =>   '&#xFE8D;'
                ),
            'إ' => array(
                'la_beg'        =>   '&#xFEF9;',
                'la_end'        =>   '&#xFEFA;',
                'middle'        =>   '&#xFE88;',

                'isolated'      =>   '&#xFE87;'
                ),

            'ؤ' => array(

                'middle'        =>   '&#xFE85;',

                'isolated'      =>   '&#xFE86;'
                ),
            'ء' => array(
                'middle'        =>   '&#xFE80;',
                'isolated'      =>   '&#xFE80;'
                ),
            'أ' => array(
                'la_beg'        =>   '&#xFEF7;',
                'la_end'        =>   '&#xFEF8;',
                'middle'        =>   '&#xFE84;',

                'isolated'      =>   '&#xFE83;'
                ),
            'آ' => array(
                'la_beg'        =>   '&#xFEF5;',
                'la_end'        =>   '&#xFEF6;',
                'middle'        =>   '&#xFE82;',

                'isolated'      =>   '&#xFE81;'
                ),
            'ى' => array(

                'middle'        =>   '&#xFEF0;',

                'isolated'      =>   '&#xFEEF;'
                ),
            'ب' => array(
                'beginning'     =>   '&#xFE91;',
                'middle'        =>   '&#xFE92;',
                'end'           =>   '&#xFE90;',
                'isolated'      =>   '&#xFE8F;'
                ),
            'ت' => array(
                'beginning'     =>   '&#xFE97;',
                'middle'        =>   '&#xFE98;',
                'end'           =>   '&#xFE96;',
                'isolated'      =>   '&#xFE95;'
                ),
            'ث' => array(
                'beginning'     =>   '&#xFE9B;',
                'middle'        =>   '&#xFE9C;',
                'end'           =>   '&#xFE9A;',
                'isolated'      =>   '&#xFE99;'
                ),
            'ج' => array(
                'beginning'     =>   '&#xFE9F;',
                'middle'        =>   '&#xFEA0;',
                'end'           =>   '&#xFE9E;',
                'isolated'      =>   '&#xFE9D;'
                ),
            'ح' => array(
                'beginning'     =>   '&#xFEA3;',
                'middle'        =>   '&#xFEA4;',
                'end'           =>   '&#xFEA2;',
                'isolated'      =>   '&#xFEA1;'
                ),
            'خ' => array(
                'beginning'     =>   '&#xFEA7;',
                'middle'        =>   '&#xFEA8;',
                'end'           =>   '&#xFEA6;',
                'isolated'      =>   '&#xFEA5;'
                ),
            'د' => array(
                'middle'        =>   '&#xFEAA;',
                'isolated'      =>   '&#xFEA9;'
                ),
            'ذ' => array(
                'middle'        =>   '&#xFEAC;',
                'isolated'      =>   '&#xFEAB;'
                ),
            'ر' => array(
                'middle'        =>   '&#xFEAE;',
                'isolated'      =>   '&#xFEAD;'
                ),
            'ز' => array(
                'middle'        =>   '&#xFEB0;',
                'isolated'      =>   '&#xFEAF;'
                ),
            'س' => array(
                'beginning'     =>   '&#xFEB3;',
                'middle'        =>   '&#xFEB4;',
                'end'           =>   '&#xFEB2;',
                'isolated'      =>   '&#xFEB1;'
                ),
            'ش' => array(
                'beginning'     =>   '&#xFEB7;',
                'middle'        =>   '&#xFEB8;',
                'end'           =>   '&#xFEB6;',
                'isolated'      =>   '&#xFEB5;'
                ),
            'ص' => array(
                'beginning'     =>   '&#xFEBB;',
                'middle'        =>   '&#xFEBC;',
                'end'           =>   '&#xFEBA;',
                'isolated'      =>   '&#xFEB9;'
                ),
            'ض' => array(
                'beginning'     =>   '&#xFEBF;',
                'middle'        =>   '&#xFEC0;',
                'end'           =>   '&#xFEBE;',
                'isolated'      =>   '&#xFEBD;'
                ),
            'ط' => array(
                'beginning'     =>   '&#xFEC3;',
                'middle'        =>   '&#xFEC4;',
                'end'           =>   '&#xFEC2;',
                'isolated'      =>   '&#xFEC1;'
                ),
            'ظ' => array(
                'beginning'     =>   '&#xFEC7;',
                'middle'        =>   '&#xFEC8;',
                'end'           =>   '&#xFEC6;',
                'isolated'      =>   '&#xFEC5;'
                ),
            'ع' => array(
                'beginning'     =>   '&#xFECB;',
                'middle'        =>   '&#xFECC;',
                'end'           =>   '&#xFECA;',
                'isolated'      =>   '&#xFEC9;'
                ),
            'غ' => array(
                'beginning'     =>   '&#xFECF;',
                'middle'        =>   '&#xFED0;',
                'end'           =>   '&#xFECE;',
                'isolated'      =>   '&#xFECD;'
                ),
            'ف' => array(
                'beginning'     =>   '&#xFED3;',
                'middle'        =>   '&#xFED4;',
                'end'           =>   '&#xFED2;',
                'isolated'      =>   '&#xFED1;'
                ),
            'ق' => array(
                'beginning'     =>   '&#xFED7;',
                'middle'        =>   '&#xFED8;',
                'end'           =>   '&#xFED6;',
                'isolated'      =>   '&#xFED5;'
                ),
            'ك' => array(
                'beginning'     =>   '&#xFEDB;',
                'middle'        =>   '&#xFEDC;',
                'end'           =>   '&#xFEDA;',
                'isolated'      =>   '&#xFED9;'
                ),
            'ل' => array(
                'beginning'     =>   '&#xFEDF;',
                'middle'        =>   '&#xFEE0;',
                'end'           =>   '&#xFEDE;',
                'isolated'      =>   '&#xFEDD;'
                ),
            'م' => array(
                'beginning'     =>   '&#xFEE3;',
                'middle'        =>   '&#xFEE4;',
                'end'           =>   '&#xFEE2;',
                'isolated'      =>   '&#xFEE1;'
                ),
            'ن' => array(
                'beginning'     =>   '&#xFEE7;',
                'middle'        =>   '&#xFEE8;',
                'end'           =>   '&#xFEE6;',
                'isolated'      =>   '&#xFEE5;'
                ),
            'ه' => array(
                'beginning'     =>   '&#xFEEB;',
                'middle'        =>   '&#xFEEC;',
                'end'           =>   '&#xFEEA;',
                'isolated'      =>   '&#xFEE9;'
                ),
            'و' => array(
                'middle'        =>   '&#xFEEE;',
                'isolated'      =>   '&#xFEED;'
                ),
            'ي' => array(
                'beginning'     =>   '&#xFEF3;',
                'middle'        =>   '&#xFEF4;',
                'end'           =>   '&#xFEF2;',
                'isolated'      =>   '&#xFEF1;'
                ),
            'ئ' => array(
                'beginning'     =>   '&#xFE8B;',
                'middle'        =>   '&#xFE8C;',
                'end'           =>   '&#xFE8A;',
                'isolated'      =>   '&#xFE89;'
                ),
            'ة' => array(
                'middle'        =>   '&#xFE94;',
                'isolated'      =>   '&#xFE93;'
                )
        );

    if(in_array($word[0].$word[1], $isolated_chars))
    {
        $new_word[] = $all_chars[$word[0].$word[1]]['isolated'];
        $char_type[] = 'not_normal';
        $al_char[] = false;
    }
    else
    {
        if(in_array($word[0].$word[1], $lam) AND in_array($word[2].$word[3], $alef))
        {
            $new_word[] = $all_chars [$word[2].$word[3]]['la_beg'];
            $char_type[] = 'not_normal';

            $al_char[] = true;
        }
        else
        {

            $new_word[] = $all_chars[$word[0].$word[1]]['beginning'];
            $char_type[] = 'normal';
            $al_char[] = false;
        }

    }

    if(strlen($word) > 4)
    {
        if($char_type[0] == 'not_normal')

        {
            if(in_array($word[2].$word[3], $isolated_chars))
            {
                if($al_char[count($al_char)-1] == false)
                {
                    $new_word[] = $all_chars[$word[2].$word[3]]['isolated'];
                    $char_type[] = 'not_normal';

                }
                $al_char[] = false;

            }
            else
            {
                if(in_array($word[2].$word[3], $lam) AND in_array($word[4].$word[5], $alef))
                {
                    $new_word[] = $all_chars[$word[4].$word[5]]['la_beg'];
                    $char_type[] = 'not_normal';
                    $al_char[] = true;
                }
                else
                {
                    $new_word[] = $all_chars[$word[2].$word[3]]['beginning'];
                    $char_type[] = 'normal';
                    $al_char[] = false;
                }

            }
        }
        else
        {
            if(in_array($word[2].$word[3], $lam) AND in_array($word[4].$word[5], $alef))
            {

                $new_word[] = $all_chars[$word[4].$word[5]]['la_end'];
                $char_type[] = 'not_normal';
                $al_char[] = true;
            }
            else
            {
                $new_word[] = $all_chars[$word[2].$word[3]]['middle'];
                if(in_array($word[2].$word[3], $isolated_chars))
                {
                    $char_type[] = 'not_normal';
                    $al_char[] = false;
                }
                else
                {
                    $char_type[] = 'normal';
                    $al_char[] = false;
                }
            }

        }
        $x = 4;
    }
    else
    {
        $x = 2; 
    }

    for($x=4;$x< (strlen($word)-4) ;$x++)
    {
        if($char_type[count($char_type)-1] == 'not_normal' AND $x %2 == 0)
        {
            if(in_array($word[$x].$word[$x+1], $isolated_chars))
            {
                if($al_char[count($al_char)-1] == false)
                {
                    $new_word[] = $all_chars[$word[$x].$word[$x+1]]['isolated'];
                    $char_type[] = 'not_normal';

                }
                $al_char[] = false;
            }
            elseif(in_array($word[$x].$word[$x+1], $lam) AND in_array($word[$x+2].$word[$x+3], $alef))
            {

                $new_word[] = $all_chars[$word[$x+2].$word[$x+3]]['la_beg'];
                $char_type[] = 'not_normal';
                $al_char[] = true;
            }
            else
            {

                $new_word[] = $all_chars[$word[$x].$word[$x+1]]['beginning'];
                $char_type[] = 'normal';
                $al_char[] = false;
            }
        }
        elseif($char_type[count($char_type)-1] == 'normal' AND $x %2 == 0)
        {

            if(in_array($word[$x].$word[$x+1], $isolated_chars))
            {
                if($al_char[count($al_char)-1] == false)
                {
                    $new_word[] = $all_chars[$word[$x].$word[$x+1]]['middle'];
                    $char_type[] = 'not_normal';
                }
                $al_char[] = false;
            }
            elseif(in_array($word[$x].$word[$x+1], $lam) AND in_array($word[$x+2].$word[$x+3], $alef))
            {

                $new_word[] = $all_chars[$word[$x+2].$word[$x+3]]['la_end'];
                $char_type[] = 'not_normal';
                $al_char[] = true;
            }
            else
            {

                $new_word[] = $all_chars[$word[$x].$word[$x+1]]['middle'];
                $char_type[] = 'normal';
                $al_char[] = false;
            }
        }

    }
    if(strlen($word)>6)
    {
        if($char_type[count($char_type)-1] == 'not_normal')
        {
            if(in_array($word[$x].$word[$x+1], $isolated_chars))
            {
                if($al_char[count($al_char)-1] == false)
                {
                    $new_word[] = $all_chars[$word[$x].$word[$x+1]]['isolated'];
                    $char_type[] = 'not_normal';
                }
                $al_char[] = false;
            }
            else
            {

                if($word[strlen($word)-2].$word[strlen($word)-1] == 'ء')
                {
                    if($al_char[count($al_char)-1] == true)
                    {
                        $new_word[] = $all_chars[$word[$x].$word[$x+1]]['isolated'];
                        $char_type[] = 'normal';
                    }
                    $al_char[] = false;
                }
                elseif(in_array($word[$x].$word[$x+1], $lam) AND in_array($word[$x+2].$word[$x+3], $alef))
                {

                    $new_word[] = $all_chars[$word[$x+2].$word[$x+3]]['la_end'];
                    $char_type[] = 'not_normal';
                    $al_char[] = true;
                }
                else
                {
                    $new_word[] = $all_chars[$word[$x].$word[$x+1]]['beginning'];
                    $char_type[] = 'normal';
                    $al_char[] = false;
                }

            }

            $x += 2;
        }
        elseif($char_type[count($char_type)-1] == 'normal' AND $al_char[count($al_char)-1] == false)
        {

            if(in_array($word[$x].$word[$x+1], $isolated_chars))
            {
                if($al_char[count($al_char)-1] == false)
                {
                    $new_word[] = $all_chars[$word[$x].$word[$x+1]]['middle'];
                    $char_type[] = 'not_normal';
                }
                $al_char[] = false;
            }
            elseif(in_array($word[$x].$word[$x+1], $lam) AND in_array($word[$x+2].$word[$x+3], $alef))
            {

                $new_word[] = $all_chars[$word[$x+2].$word[$x+3]]['la_end'];
                $char_type[] = 'not_normal';
                $al_char[] = true;
            }
            else
            {

                $new_word[] = $all_chars[$word[$x].$word[$x+1]]['middle'];
                $char_type[] = 'normal';
                $al_char[] = false;
            }

            $x += 2;
        }


    }

    if($char_type[count($char_type)-1] == 'not_normal')
    {

        if(in_array($word[$x].$word[$x+1], $isolated_chars))
        {       
            if($al_char[count($al_char)-1] == false)
            {
                $new_word[] = $all_chars[$word[$x].$word[$x+1]]['isolated'];
            }

        }
        else
        {
            $new_word[] = $all_chars[$word[$x].$word[$x+1]]['isolated'];

        }

    }
    else
    {
        if(in_array($word[$x].$word[$x+1], $isolated_chars))
        {

            $new_word[] = $all_chars[$word[$x].$word[$x+1]]['middle'];

        }
        else
        {

            $new_word[] = $all_chars[$word[$x].$word[$x+1]]['end'];

        }
    }

    return implode('',array_reverse($new_word));
}

使用功能:

$word = 'لا اله الا الله محمد رسول الله , اللهم لا علم لي الا ما علمتني انك انت العليم الحكيم';
fixArabicCharactersAndCreateProbablyString($word)