PHP:在文本中查找带有和不带空格的重复单词

时间:2011-06-08 15:22:43

标签: php text find words

我可以使用此功能在文本中找到重复的单词:

$str = 'bob is a good person. mary is a good person. who is the best? are you a good person? bob is the best?';
    function repeated($str)
    {
        $str=trim($str);  
        $str=ereg_replace('[[:space:]]+', ' ',$str);  
        $words=explode(' ',$str);  
        foreach($words as $w)  
        {  
        $wordstats[($w)]++;  
        }  
        foreach($wordstats as $k=>$v)  
        {  
            if($v>=2)  
            {  
                print "$k"." , ";  
            }  
        }  
    }
这就是我的结果:

bob , good , person , is , a , the , best?

问:我怎样才能得到结果重复的单词和空格之间的多部分单词看起来像:

bob , good , person , is , a , the , best? , good person , is a , a good , is the , bob is

2 个答案:

答案 0 :(得分:3)

<?php
$str = 'bob is a good person. mary is a good person. who is the best? are you a good person? bob is the best?';

//all words:
$found = str_word_count(strtolower($str),1);
//get all words with occurance of more then 1
$counts = array_count_values($found);
$repeated = array_keys(array_filter($counts,function($a){return $a > 1;}));
//begin results with the groups of 1 word.
$results = $repeated;
while($word = array_shift($found)){
    if(!in_array($word,$repeated)) continue;
    $additions = array();
    while($add = array_shift($found)){
        if(!in_array($add,$repeated)) break;
        $additions[] = $add;
        $count = preg_match_all('/'.preg_quote($word).'\W+'.implode('\W+',$additions).'/si',$str,$matches);
        if($count > 1){
            $newmatch = $word.' '.implode(' ',$additions);
            if(!in_array($newmatch,$results)) $results[] = $newmatch;
        } else {
            break;
        }
    }
    if(!empty($additions)) array_splice($found,0,0,$additions);
}
var_dump($results);

收率:

array(17) {
  [0]=>
  string(3) "bob"
  [1]=>
  string(2) "is"
  [2]=>
  string(1) "a"
  [3]=>
  string(4) "good"
  [4]=>
  string(6) "person"
  [5]=>
  string(3) "the"
  [6]=>
  string(4) "best"
  [7]=>
  string(6) "bob is"
  [8]=>
  string(4) "is a"
  [9]=>
  string(9) "is a good"
  [10]=>
  string(16) "is a good person"
  [11]=>
  string(6) "a good"
  [12]=>
  string(13) "a good person"
  [13]=>
  string(11) "good person"
  [14]=>
  string(6) "is the"
  [15]=>
  string(11) "is the best"
  [16]=>
  string(8) "the best"
}

答案 1 :(得分:2)

你不能只将双字添加到$ wordstats数组吗?

$str = 'bob is a good person. mary is a good person. who is the best? are you a good person? bob is the best?';
function repeated($str)
{
    $str=trim($str);  
    $str=ereg_replace('[[:space:]]+', ' ',$str);  
    $words=explode(' ',$str);  
    $lastWord = '';
    foreach($words as $w)  
    {  
        $wordstats[($w)]++;  
        //skip the first loop because that is the only time it should be blank.
        if($lastWord!=''){
            $wordstats[$lastWord.' '.$w]++;
        }
        $lastWord = $w;
    }  
    foreach($wordstats as $k=>$v)  
    {  
        if($v>=2)  
        {  
            print "$k"." , ";  
        }  
    }  
}

我没有对此进行测试,但它应该可以工作,因为它只使用了您使用的相同技术。