Question

我想从数组中获取最常用的单词。唯一的问题是瑞典字符（Å，Ä和Ö）只会显示为。

$string = 'This is just a test post with the Swedish characters Å, Ä, and Ö. Also as lower cased characters: å, ä, and ö.';
echo '<pre>';
print_r(array_count_values(str_word_count($string, 1, 'àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ')));
echo '</pre>';

该代码将输出以下内容：

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [�] => 1
    [�] => 1
    [and] => 2
    [�] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [�] => 1
    [�] => 1
    [�] => 1
)

如何让它“看到”瑞典字符和其他特殊字符？

Answer 1

所有这一切都是在您使用UTF-8的假设下运行的。

您可以使用preg_split()采用天真的方法将字符串拆分为任何分隔符，标点符号或控制字符。

`preg_split`示例：

$split = preg_split('/[\pZ\pP\pC]/u', $string, -1, PREG_SPLIT_NO_EMPTY);
print_r(array_count_values($split));

输出：

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

这适用于您的给定字符串，但不一定以区域设置感知的方式拆分单词。例如＆＃34;收缩＆＃34; t＆＃34;将被分解为＆＃34; isn＆＃34;和＆＃34; t＆＃34;通过这个。

值得庆幸的是Intl extension在PHP 7中添加了大量功能来处理这样的事情。

计划是：

* Normalize输入Normalizer::normalize()以确保字形数据以一致的方式编码。例如，ä可能会被编码，因此会以几种方式计算：
- U + 00E4＆＃39;带有DIAERESIS的拉丁文小写字母＆＃39;或
- U + 0061＆＃39;拉丁文小写字母A＆＃39;其次是U + 0308＆＃39; COMBINING DIAERESIS＆＃39;
通过IntlBreakIterator获取以区域设置相关方式打破单词的IntlBreakIterator::createWordInstance()。这可以理解构成一个单词＆＃34;对于给定的区域设置，包括处理收缩，例如＆＃34; isn＆＃39;＃＆lt; 34;
通过IntlPartsIterator获取IntlBreakIterator::getPartsIterator()，以便于对文本片段进行迭代。
通过IntlChar::ispunct()和IntlChar::isspace()

（*请注意，无论您使用什么方法来分解字符串，您都可能希望执行规范化 - 它适合在上面preg_split之前执行或者无论你决定选择什么。）

Intl示例：

$string = Normalizer::normalize($string);

$iter = IntlBreakIterator::createWordInstance("sv_SE");
$iter->setText($string);
$words = $iter->getPartsIterator();

$split = [];
foreach ($words as $word) {
    // skip text fragments consisting only of a space or punctuation character
    if (IntlChar::isspace($word) || IntlChar::ispunct($word)) {
        continue;
    }
    $split[] = $word;
}

print_r(array_count_values($split));

输出：

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

这更加冗长，但如果您更喜欢ICU（支持Intl扩展的库）来理解构成单词的内容，那么这可能是值得的。

Answer 2

这是一个使用正则表达式使用Unicode标点符号来分割“单词”然后只是常规数组出现次数的解决方案。

array_count_values(preg_split('/[[:punct:]\s]+/u', $string, -1, PREG_SPLIT_NO_EMPTY));

产地：

Array
(
    [This] => 1
    [is] => 1
    [just] => 1
    [a] => 1
    [test] => 1
    [post] => 1
    [with] => 1
    [the] => 1
    [Swedish] => 1
    [characters] => 2
    [Å] => 1
    [Ä] => 1
    [and] => 2
    [Ö] => 1
    [Also] => 1
    [as] => 1
    [lower] => 1
    [cased] => 1
    [å] => 1
    [ä] => 1
    [ö] => 1
)

这是在unicode控制台中测试的，如果您使用的是浏览器，则可能需要使用编码。在浏览器中创建<meta>标记或设置编码，或发送PHP标题。

Answer 3

我设法通过将ÅåÄäÖö添加到àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ来删除�标记。

获取最常用的带有特殊字符的单词

3 个答案:

`preg_split`示例：

输出：

Intl示例：

输出：

获取最常用的带有特殊字符的单词

3 个答案:

preg_split示例：

输出：

Intl示例：

输出：

`preg_split`示例：