我有一个文本,其中我想计算短语" lorem ipsum dolor"的出现。
Lorem ipsum dolor 坐下来,精神上的adipistur elit。 Ipsum lorem dolor Curabitur ac risus nunc。 Dolor ipsum lorem 。
即使搜索短语以不同的顺序写入,算法也应计算出现次数。我已经强调了预期的结果。有没有比使用正则表达式更好的方法来实现这一点?
在这种情况下,结果应该等于3
该短语将有大约3-4个单词,字符串将是网页的内容。
答案 0 :(得分:2)
你可以试试正则表达式:
/(?:(?:(?:lorem|ipsum|dolor)\s?)+)/gi
使用preg_match_all
然后计算匹配数。从您的示例中,您应该获得3 matches。
我不擅长算法,也不擅长PHP,但尝试......
<?php
$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';
$lower_string = strtolower($string);
$text = array('lorem', 'ipsum', 'dolor');
$perms = AllPermutations($text);
$result = 0;
foreach ($perms as $piece) {
$phrase = join(' ', $piece);
$result += substr_count($lower_string, $phrase);
}
# From http://stackoverflow.com/a/12749950/1578604
function AllPermutations($InArray, $InProcessedArray = array())
{
$ReturnArray = array();
foreach($InArray as $Key=>$value)
{
$CopyArray = $InProcessedArray;
$CopyArray[$Key] = $value;
$TempArray = array_diff_key($InArray, $CopyArray);
if (count($TempArray) == 0)
{
$ReturnArray[] = $CopyArray;
}
else
{
$ReturnArray = array_merge($ReturnArray, AllPermutations($TempArray, $CopyArray));
}
}
return $ReturnArray;
}
echo $result;
?>
答案 1 :(得分:2)
$haystack = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';
$needle = 'Lorem ipsum dolor';
$hayWords = str_word_count(
strtolower($haystack),
1
);
$needleWords = str_word_count(
strtolower($needle),
1
);
$needleWordsCount = count($needleWords);
$foundWords = array_intersect(
$hayWords,
$needleWords
);
$count = array_reduce(
array_keys($foundWords),
function($counter, $item) use ($foundWords, $needleWordsCount) {
for($i = $item; $i < $item + $needleWordsCount; ++$i) {
if (!isset($foundWords[$i]))
return $counter;
}
return ++$counter;
},
0
);
var_dump($count);
答案 2 :(得分:1)
我认为您正在寻找:http://nl1.php.net/substr_count
$text = 'This is a test';
echo strlen($text); // 14
echo substr_count($text, 'is'); // 2
// the string is reduced to 's is a test', so it prints 1
echo substr_count($text, 'is', 3);
// the text is reduced to 's i', so it prints 0
echo substr_count($text, 'is', 3, 3);
// generates a warning because 5+10 > 14
echo substr_count($text, 'is', 5, 10);
// prints only 1, because it doesn't count overlapped substrings
$text2 = 'gcdgcdgcd';
echo substr_count($text2, 'gcdgcd');
答案 3 :(得分:1)
大家晚上好。我想出了另一种技巧。这个对Mark Baker所做的采用了不同的方法,我非常感激。另外,请按查看内存使用情况。
简而言之,需要匹配基本字符串(lorem ipsum dolor),然后将其拖入所有可能的组合(在这种情况下为3!= 6)。
此外,然后将所有这6个字符串组合添加到用于进行匹配substr_count的数组中。我还在使用shuffle()
,in_array
和array_push
。
代码是自我解释的,如果你好奇,这是我的IDEONE。这是Mark Baker在IDEONE上的解决方案。他们都花费相同的时间和内存,我的解决方案是4行更短,如果不是更优雅:)
<?php
$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';
//convert main string to lowercase to have an even playing field
$string2 = strtolower($string);
$substring = 'lorem ipsum dolor';
//add the first lorem ipsum dolor to launch the array
$arr = array($substring);
//run until the array is full with all possible combinations i.e. 6 (factorial of 3)
for ($i=0; $i<=20; $i++) {
$wordArray = explode(" ",$substring);
shuffle($wordArray);
$randString= implode(" ",$wordArray);
//if random string isn't in the array, then only you push the new value
while (! (in_array($randString,$arr)) ) {
array_push($arr,$randString);
}
}
//var_dump($arr);
//here, we do the matching, and this is pretty self explanatory
$n = sizeof($arr);
for ($q=0; $q<=$n; $q++) {
$sum += substr_count($string2,$arr[$q]);
}
echo "Total occurances: ".$sum;
?>
正如您已经看到的那样,Mark的代码在+2次上升了我,但由于该程序的性质以及相关的数据复杂性,差异非常微不足道。显然,鉴于程序的复杂性,差异可能很大,但这就是它的本质。