Question

我已经使下一个函数从文本中返回特定数量的单词：

function brief_text($text, $num_words = 50) {
    $words = str_word_count($text, 1);
    $required_words = array_slice($words, 0, $num_words);
    return implode(" ", $required_words);
}

并且它与英语非常相似但是当我尝试将它与阿拉伯语一起使用时，它会失败并且不会按预期返回单词。例如：

$text_en = "Cairo is the capital of Egypt and Paris is the capital of France";
echo brief_text($text_en, 10);

将

输出 Cairo is the capital of Egypt and Paris is the

$text_ar = "القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا";
echo brief_text($text_ar, 10);

将输出� � � � � � � � � �。

我知道问题出在str_word_count函数上，但我不知道如何修复它。

更新

我已经编写了另一个与英语和阿拉伯语一起使用的功能，但我正在寻找一个解决由str_word_count()函数引起的问题的解决方案。无论如何，这是我的另一个功能：

    function brief_text($string, $number_of_required_words = 50) {
        $string = trim(preg_replace('/\s+/', ' ', $string));
        $words = explode(" ", $string);
        $required_words = array_slice($words, 0, $number_of_required_words); // get sepecific number of elements from the array
        return implode(" ", $required_words);
    }

Answer 1

尝试使用此功能进行字数统计：

// You can call the function as you like
if (!function_exists('mb_str_word_count'))
{
    function mb_str_word_count($string, $format = 0, $charlist = '[]') {
        mb_internal_encoding( 'UTF-8');
        mb_regex_encoding( 'UTF-8');

        $words = mb_split('[^\x{0600}-\x{06FF}]', $string);
        switch ($format) {
            case 0:
                return count($words);
                break;
            case 1:
            case 2:
                return $words;
                break;
            default:
                return $words;
                break;
        }
    };
}



echo mb_str_word_count("القاهرة هى عاصمة مصر وباريس هى عاصمة فرنسا") . PHP_EOL;

资源

Unicode list for arabic
A Rule-Based Arabic Stemming Algorithm
A Rule and Template Based Stemming Algorithm for Arabic Language（似乎更完整）

推荐内容

在HTML文件中使用标记<meta charset="UTF-8"/>
在投放网页时始终添加Content-type: text/html; charset=utf-8标题

Answer 2

也接受ASCII字符：

if (!function_exists('mb_str_word_count'))
{
    function mb_str_word_count($string, $format = 0, $charlist = '[]') {
        $string=trim($string);
        if(empty($string))
            $words = array();
        else
            $words = preg_split('~[^\p{L}\p{N}\']+~u',$string);
        switch ($format) {
            case 0:
                return count($words);
                break;
            case 1:
            case 2:
                return $words;
                break;
            default:
                return $words;
                break;
        }
    }
}

str_word_count（）函数无法正确显示阿拉伯语

2 个答案:

资源

推荐内容