计算文本中单词的出现次数

时间:2014-01-06 18:29:08

标签: php regex algorithm

我有一个文本,其中我想计算短语" lorem ipsum dolor"的出现。

  

Lorem ipsum dolor 坐下来,精神上的adipistur elit。 Ipsum lorem dolor Curabitur ac risus nunc。 Dolor ipsum lorem

即使搜索短语以不同的顺序写入,算法也应计算出现次数。我已经强调了预期的结果。有没有比使用正则表达式更好的方法来实现这一点?

在这种情况下,结果应该等于3

  • Lorem ipsum dolor
  • Ipsum lorem dolor
  • Dolor ipsum lorem

该短语将有大约3-4个单词,字符串将是网页的内容。

4 个答案:

答案 0 :(得分:2)

你可以试试正则表达式:

/(?:(?:(?:lorem|ipsum|dolor)\s?)+)/gi

使用preg_match_all然后计算匹配数。从您的示例中,您应该获得3 matches


我不擅长算法,也不擅长PHP,但尝试......

<?php

$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';

$lower_string = strtolower($string);

$text = array('lorem', 'ipsum', 'dolor');

$perms = AllPermutations($text);
$result = 0;
foreach ($perms as $piece) {
    $phrase = join(' ', $piece);
    $result += substr_count($lower_string, $phrase);
}

# From http://stackoverflow.com/a/12749950/1578604
function AllPermutations($InArray, $InProcessedArray = array())
{
    $ReturnArray = array();
    foreach($InArray as $Key=>$value)
    {
        $CopyArray = $InProcessedArray;
        $CopyArray[$Key] = $value;
        $TempArray = array_diff_key($InArray, $CopyArray);
        if (count($TempArray) == 0)
        {
            $ReturnArray[] = $CopyArray;
        }
        else
        {
            $ReturnArray = array_merge($ReturnArray, AllPermutations($TempArray, $CopyArray));
        }
    }
    return $ReturnArray;
}

echo $result;
?>

ideone demo

答案 1 :(得分:2)

$haystack = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';
$needle = 'Lorem ipsum dolor';

$hayWords = str_word_count(
    strtolower($haystack), 
    1
);
$needleWords = str_word_count(
    strtolower($needle), 
    1
);
$needleWordsCount = count($needleWords);

$foundWords = array_intersect(
    $hayWords, 
    $needleWords
);

$count = array_reduce(
    array_keys($foundWords),
    function($counter, $item) use ($foundWords, $needleWordsCount) {
        for($i = $item; $i < $item + $needleWordsCount; ++$i) {
            if (!isset($foundWords[$i]))
                return $counter;
        }
        return ++$counter;
    },
    0
);

var_dump($count);

答案 2 :(得分:1)

我认为您正在寻找:http://nl1.php.net/substr_count

$text = 'This is a test';
echo strlen($text); // 14

echo substr_count($text, 'is'); // 2

// the string is reduced to 's is a test', so it prints 1
echo substr_count($text, 'is', 3);

// the text is reduced to 's i', so it prints 0
echo substr_count($text, 'is', 3, 3);

// generates a warning because 5+10 > 14
echo substr_count($text, 'is', 5, 10);


// prints only 1, because it doesn't count overlapped substrings
$text2 = 'gcdgcdgcd';
echo substr_count($text2, 'gcdgcd');

答案 3 :(得分:1)

注意:也可以使用“Lorem ipsum dolor dolor”。

大家晚上好。我想出了另一种技巧。这个对Mark Ba​​ker所做的采用了不同的方法,我非常感激。另外,请按查看内存使用情况

简而言之,需要匹配基本字符串(lorem ipsum dolor),然后将其拖入所有可能的组合(在这种情况下为3!= 6)。

此外,然后将所有这6个字符串组合添加到用于进行匹配substr_count的数组中。我还在使用shuffle()in_arrayarray_push

代码是自我解释的,如果你好奇,这是我的IDEONE。这是Mark Ba​​ker在IDEONE上的解决方案。他们都花费相同的时间和内存,我的解决方案是4行更短,如果不是更优雅:)

<?php

    $string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ipsum lorem dolor Curabitur ac risus nunc. Dolor ipsum lorem.';

//convert main string to lowercase to have an even playing field
    $string2 = strtolower($string);
    $substring = 'lorem ipsum dolor';

//add the first lorem ipsum dolor to launch the array 
    $arr = array($substring);

//run until the array is full with all possible combinations i.e. 6 (factorial of 3)
    for ($i=0; $i<=20; $i++) {
        $wordArray = explode(" ",$substring);
        shuffle($wordArray);
        $randString= implode(" ",$wordArray);

//if random string isn't in the array, then only you push the new value 
        while (! (in_array($randString,$arr)) ) {
            array_push($arr,$randString);
        }

    }

//var_dump($arr);

//here, we do the matching, and this is pretty self explanatory
    $n = sizeof($arr);
    for ($q=0; $q<=$n; $q++) {
        $sum += substr_count($string2,$arr[$q]);
    }

    echo "Total occurances: ".$sum;

?>

内存使用

正如您已经看到的那样,Mark的代码在+2次上升了我,但由于该程序的性质以及相关的数据复杂性,差异非常微不足道。显然,鉴于程序的复杂性,差异可能很大,但这就是它的本质。

enter image description here