Question

我有一个文本，其中我想计算短语＆＃34; lorem ipsum dolor＆＃34;的出现。

Lorem ipsum dolor 坐下来，精神上的adipistur elit。 Ipsum lorem dolor Curabitur ac risus nunc。 Dolor ipsum lorem 。

即使搜索短语以不同的顺序写入，算法也应计算出现次数。我已经强调了预期的结果。有没有比使用正则表达式更好的方法来实现这一点？

在这种情况下，结果应该等于3

Lorem ipsum dolor
Ipsum lorem dolor
Dolor ipsum lorem

该短语将有大约3-4个单词，字符串将是网页的内容。

Answer 1

你可以试试正则表达式：

/(?:(?:(?:lorem|ipsum|dolor)\s?)+)/gi

使用preg_match_all然后计算匹配数。从您的示例中，您应该获得3 matches。

我不擅长算法，也不擅长PHP，但尝试......

<?php

$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';

$lower_string = strtolower($string);

$text = array('lorem', 'ipsum', 'dolor');

$perms = AllPermutations($text);
$result = 0;
foreach ($perms as $piece) {
    $phrase = join(' ', $piece);
    $result += substr_count($lower_string, $phrase);
}

# From http://stackoverflow.com/a/12749950/1578604
function AllPermutations($InArray, $InProcessedArray = array())
{
    $ReturnArray = array();
    foreach($InArray as $Key=>$value)
    {
        $CopyArray = $InProcessedArray;
        $CopyArray[$Key] = $value;
        $TempArray = array_diff_key($InArray, $CopyArray);
        if (count($TempArray) == 0)
        {
            $ReturnArray[] = $CopyArray;
        }
        else
        {
            $ReturnArray = array_merge($ReturnArray, AllPermutations($TempArray, $CopyArray));
        }
    }
    return $ReturnArray;
}

echo $result;
?>

ideone demo

Answer 2

$haystack = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';
$needle = 'Lorem ipsum dolor';

$hayWords = str_word_count(
    strtolower($haystack), 
    1
);
$needleWords = str_word_count(
    strtolower($needle), 
    1
);
$needleWordsCount = count($needleWords);

$foundWords = array_intersect(
    $hayWords, 
    $needleWords
);

$count = array_reduce(
    array_keys($foundWords),
    function($counter, $item) use ($foundWords, $needleWordsCount) {
        for($i = $item; $i < $item + $needleWordsCount; ++$i) {
            if (!isset($foundWords[$i]))
                return $counter;
        }
        return ++$counter;
    },
    0
);

var_dump($count);

Answer 3

我认为您正在寻找：http://nl1.php.net/substr_count

$text = 'This is a test';
echo strlen($text); // 14

echo substr_count($text, 'is'); // 2

// the string is reduced to 's is a test', so it prints 1
echo substr_count($text, 'is', 3);

// the text is reduced to 's i', so it prints 0
echo substr_count($text, 'is', 3, 3);

// generates a warning because 5+10 > 14
echo substr_count($text, 'is', 5, 10);


// prints only 1, because it doesn't count overlapped substrings
$text2 = 'gcdgcdgcd';
echo substr_count($text2, 'gcdgcd');

Answer 4

注意：也可以使用“Lorem ipsum dolor dolor”。

大家晚上好。我想出了另一种技巧。这个对Mark Baker所做的采用了不同的方法，我非常感激。另外，请按查看内存使用情况。

简而言之，需要匹配基本字符串（lorem ipsum dolor），然后将其拖入所有可能的组合（在这种情况下为3！= 6）。

此外，然后将所有这6个字符串组合添加到用于进行匹配substr_count的数组中。我还在使用shuffle()，in_array和array_push。

代码是自我解释的，如果你好奇，这是我的IDEONE。这是Mark Baker在IDEONE上的解决方案。他们都花费相同的时间和内存，我的解决方案是4行更短，如果不是更优雅:)

<?php

    $string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';

//convert main string to lowercase to have an even playing field
    $string2 = strtolower($string);
    $substring = 'lorem ipsum dolor';

//add the first lorem ipsum dolor to launch the array 
    $arr = array($substring);

//run until the array is full with all possible combinations i.e. 6 (factorial of 3)
    for ($i=0; $i<=20; $i++) {
        $wordArray = explode(" ",$substring);
        shuffle($wordArray);
        $randString= implode(" ",$wordArray);

//if random string isn't in the array, then only you push the new value 
        while (! (in_array($randString,$arr)) ) {
            array_push($arr,$randString);
        }

    }

//var_dump($arr);

//here, we do the matching, and this is pretty self explanatory
    $n = sizeof($arr);
    for ($q=0; $q<=$n; $q++) {
        $sum += substr_count($string2,$arr[$q]);
    }

    echo "Total occurances: ".$sum;

?>

内存使用

正如您已经看到的那样，Mark的代码在+2次上升了我，但由于该程序的性质以及相关的数据复杂性，差异非常微不足道。显然，鉴于程序的复杂性，差异可能很大，但这就是它的本质。

enter image description here

计算文本中单词的出现次数

4 个答案:

注意：也可以使用“Lorem ipsum dolor dolor”。

内存使用