在PHP中计算字符串中的短语

时间:2017-03-15 16:26:42

标签: php

我目前正在抓取HTML页面并使用以下方式计算页面上的单个单词:

$page_content = file_get_html($url)->plaintext;

$word_array = array_count_values(str_word_count(strip_tags(strtolower($page_content));

这对计算单个单词非常有用。

但我正在尝试计算最多约3个单词的短语。

例如:

$string = 'the best stack post';

计数将返回:

the = 1
best = 1
stack = 1
post = 1

我需要从短语中删除短语,因此该字符串中的三个单词短语可能是:

the best stack = 1
best stack post = 1

我希望这是有道理的!

我已经搜索过,但在PHP中无法找到任何方法。

有什么想法吗?

3 个答案:

答案 0 :(得分:0)

我要做的是获取页面内容并删除html标签。然后通过典型的短语分隔符(即点(。))分解文本。现在你有一系列单个短语,你可以计算单个单词:

$page_content = file_get_html($url)->plaintext;
$text = strip_tags(strtolower($page_content));

$phrases = explode(".", $text);

$count = 0;
foreach ($phrases as $phrase) {
    if (str_word_count($phrase) >= 3) {
        $count++;
    }
}

答案 1 :(得分:0)

这个解决方案有两个步骤。

  1. 有一个函数可以从字符串中获取所有3个单词短语(忽略任何句号)
  2. 主要功能将在每个句子上使用前一个功能(由.终止。)
  3. 以下是代码:

    function threeWords($string) {
          $words = array_values(array_filter(preg_split("!\W!",$string))); //Split on non-word characters. Not ideal probably since it will count "non-hyphenated" as 2 words. 
          if (count($words) < 3) { return []; }
          $phrases = [];
          for ($i = 2;$i < count($words);$i++) {
               $phrases[] = $words[$i-2]." ".$words[$i-1]." ".$words[$i];
          }
          return $phrases;
    }
    
    $page_content = file_get_html($url)->plaintext;
    $text = strip_tags(strtolower($page_content));
    $sentences = explode(".",$text);
    $phrases = [];
    foreach ($sentences as $sentence) {
       $phrases = array_merge($phrases,threeWords(trim($sentence)));
    }
    $count = array_count_values($phrases);
    print_r($count);
    

答案 2 :(得分:0)

// Split the string into sentences on the appropriate punctuation marks
// and loop over the sentences
foreach (preg_split('/[?.!]/', $string) as $sentence) {

    // split the sentences into words (remove any empty strings with array_filter)
    $words = array_filter(explode(' ', $sentence));

    // take the first set of three words from the sentence, then remove the first word,
    // until the sentence is gone.
    while ($words) {
        $phrase = array_slice($words, 0, 3);

        // check that the phrase is the correct length            
        if (count($phrase) == 3) {

            // convert it back to a string
            $phrase = implode(' ', $phrase);

            // increment the count for that phrase in your result
            if (!isset($phrases[$phrase])) $phrases[$phrase] = 0;
            $phrases[$phrase]++;
        }
        array_shift($words);
    }
}