我想从数组中获取最常用的单词。唯一的问题是瑞典字符(Å,Ä和Ö)只会显示为 。
$string = 'This is just a test post with the Swedish characters Å, Ä, and Ö. Also as lower cased characters: å, ä, and ö.';
echo '<pre>';
print_r(array_count_values(str_word_count($string, 1, 'àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ')));
echo '</pre>';
该代码将输出以下内容:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[�] => 1
[�] => 1
[and] => 2
[�] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[�] => 1
[�] => 1
[�] => 1
)
如何让它“看到”瑞典字符和其他特殊字符?
答案 0 :(得分:4)
所有这一切都是在您使用UTF-8的假设下运行的。
您可以使用preg_split()
采用天真的方法将字符串拆分为任何分隔符,标点符号或控制字符。
preg_split
示例:$split = preg_split('/[\pZ\pP\pC]/u', $string, -1, PREG_SPLIT_NO_EMPTY);
print_r(array_count_values($split));
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这适用于您的给定字符串,但不一定以区域设置感知的方式拆分单词。例如&#34;收缩&#34; t&#34;将被分解为&#34; isn&#34;和&#34; t&#34;通过这个。
值得庆幸的是Intl extension在PHP 7中添加了大量功能来处理这样的事情。
计划是:
* Normalize输入Normalizer::normalize()
以确保字形数据以一致的方式编码。例如,ä
可能会被编码,因此会以几种方式计算:
通过IntlBreakIterator
获取以区域设置相关方式打破单词的IntlBreakIterator::createWordInstance()
。这可以理解构成一个单词&#34;对于给定的区域设置,包括处理收缩,例如&#34; isn&#39;#&lt; 34;
通过IntlPartsIterator
获取IntlBreakIterator::getPartsIterator()
,以便于对文本片段进行迭代。
(*请注意,无论您使用什么方法来分解字符串,您都可能希望执行规范化 - 它适合在上面preg_split
之前执行或者无论你决定选择什么。)
$string = Normalizer::normalize($string);
$iter = IntlBreakIterator::createWordInstance("sv_SE");
$iter->setText($string);
$words = $iter->getPartsIterator();
$split = [];
foreach ($words as $word) {
// skip text fragments consisting only of a space or punctuation character
if (IntlChar::isspace($word) || IntlChar::ispunct($word)) {
continue;
}
$split[] = $word;
}
print_r(array_count_values($split));
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这更加冗长,但如果您更喜欢ICU(支持Intl扩展的库)来理解构成单词的内容,那么这可能是值得的。
答案 1 :(得分:1)
这是一个使用正则表达式使用Unicode标点符号来分割“单词”然后只是常规数组出现次数的解决方案。
array_count_values(preg_split('/[[:punct:]\s]+/u', $string, -1, PREG_SPLIT_NO_EMPTY));
产地:
Array
(
[This] => 1
[is] => 1
[just] => 1
[a] => 1
[test] => 1
[post] => 1
[with] => 1
[the] => 1
[Swedish] => 1
[characters] => 2
[Å] => 1
[Ä] => 1
[and] => 2
[Ö] => 1
[Also] => 1
[as] => 1
[lower] => 1
[cased] => 1
[å] => 1
[ä] => 1
[ö] => 1
)
这是在unicode控制台中测试的,如果您使用的是浏览器,则可能需要使用编码。在浏览器中创建<meta>
标记或设置编码,或发送PHP标题。
答案 2 :(得分:0)
我设法通过将ÅåÄäÖö
添加到àáãâçêéíîóõôúÀÁÃÂÇÊÉÍÎÓÕÔÚ
来删除�标记。