我有一个PHP数组:
$excerpts = array(
'I love cheap red apples',
'Cheap red apples are what I love',
'Do you sell cheap red apples?',
'I want red apples',
'Give me my red apples',
'OK now where are my apples?'
);
我想找到这些行中的所有n-gram来得到这样的结果:
我试图破坏数组,然后解析它,但这是愚蠢的,因为可以找到新的n-gram,因为字符串的串联彼此无关。
你会怎么做?
答案 0 :(得分:3)
我想在不知道它们的情况下找到一组单词 有了你的功能,我需要在任何事情之前提供它们
试试这个:
mb_internal_encoding('UTF-8');
$joinedExcerpts = implode(".\n", $excerpts);
$sentences = preg_split('/[^\s|\pL]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);
$wordsSequencesCount = array();
foreach($sentences as $sentence) {
$words = array_map('mb_strtolower',
preg_split('/[^\pL+]/umi', $sentence, -1, PREG_SPLIT_NO_EMPTY));
foreach($words as $index => $word) {
$wordsSequence = '';
foreach(array_slice($words, $index) as $nextWord) {
$wordsSequence .= $wordsSequence ? (' ' . $nextWord) : $nextWord;
if( !isset($wordsSequencesCount[$wordsSequence]) ) {
$wordsSequencesCount[$wordsSequence] = 0;
}
++$wordsSequencesCount[$wordsSequence];
}
}
}
$ngramsCount = array_filter($wordsSequencesCount,
function($count) { return $count > 1; });
我假设你只想重复一组单词。
var_dump($ngramsCount);
的输出是:
array (size=11)
'i' => int 3
'i love' => int 2
'love' => int 2
'cheap' => int 3
'cheap red' => int 3
'cheap red apples' => int 3
'red' => int 5
'red apples' => int 5
'apples' => int 6
'are' => int 2
'my' => int 2
可以优化代码,例如,使用更少的内存。
答案 1 :(得分:1)
The code provided by Pedro Amaral Couto非常好。 由于我将它用于法语,我修改了正则表达式如下:
$sentences = preg_split('/[^\s|\pL-\'’]/umi', $joinedExcerpts, -1, PREG_SPLIT_NO_EMPTY);
这样,我们可以分析包含连字符和撇号的词(“est-ce que”,“j'ai”等)
答案 2 :(得分:0)
试试这个(使用implode
,因为您已经提到过这是一次尝试):
$ngrams = array(
'cheap red apples',
'red apples',
'apples',
);
$joinedExcerpts = implode("\n", $excerpts);
$nGramsCount = array_fill_keys($ngrams, 0);
var_dump($ngrams, $joinedExcerpts);
foreach($ngrams as $ngram) {
$regex = '/(?:^|[^\pL])(' . preg_quote($ngram, '/') . ')(?:$|[^\pL])/umi';
$nGramsCount[$ngram] = preg_match_all($regex, $joinedExcerpts);
}
答案 3 :(得分:-1)
假设您只想计算字符串的出现次数:
$cheapRedAppleCount = 0;
$redAppleCount = 0;
$appleCount = 0;
for($i = 0; $i < count($excerpts); $i++)
{
$cheapRedAppleCount += preg_match_all('cheap red apples', $excerpts[$i]);
$redAppleCount += preg_match_all('red apples', $excerpts[$i]);
$appleCount += preg_match_all('apples', $excerpts[$i]);
}
preg_match_all
返回给定字符串中的匹配数,因此您只需将匹配数添加到计数器上即可。
preg_match_all了解更多信息。
如果我误解了道歉。