提取两段之间的常用词?

时间:2010-03-22 17:12:32

标签: php string

如何在php5中提取两段或多段之间的常用词? 我想总结每个文本可能会创建一个排名很高的单词列表 然后比较它们。

5 个答案:

答案 0 :(得分:5)

我想最基本的方法是:

  • 使用 explode preg_split 将每个段落拆分为单词数组
    • 第一个可能会快一点
    • 第二个可能会提供更多选项
  • 也许,对单词列表进行一些过滤
    • 清理每个单词
      • 删除特殊字符,如重音字母
      • 将所有内容转换为大写/小写,以帮助您稍后进行比较
    • 删除过于常见的字词
    • 删除太短
    • array_filter ,在这里,可能会有所帮助
  • 然后,使用 array_intersect
  • 之类的内容获取两个数组中的单词列表

答案 1 :(得分:4)

可能有一种更快的方法,但你可以像标记符号那样拼写标点符号!? - 。/ \ @ @ $%^& *,然后将两个段落分解成一个数组,然后在两个数组上尝试array_intersect()。数组1中阵列1中的任何内容都应该作为匹配返回。

http://php.net/manual/en/function.array-intersect.php

理论上你应该收到一系列匹配的单词。从那里,排名取决于你以及你如何选择这样做。

答案 2 :(得分:2)

这样的事可能有用......

<?php
  $paragraph = "hello this is some sample text. Sample text is usually used to test a program. For example, this sample text will be used to test the script below.";
  $words = array();
  preg_match_all('/\w+/', $paragraph, $matches);
  foreach($matches[0] as $w){
    $w = strtolower($w);
    if(!array_key_exists($w, $words)){
      $words[$w] = 0;
    }
    $words[$w]++;
  }
  asort($words);
  echo print_r($words, true);

  /* Output
  Array (
      [hello] => 1
      [will] => 1
      [example] => 1
      [a] => 1
      [program] => 1
      [usually] => 1
      [Sample] => 1
      [script] => 1
      [below] => 1
      [some] => 1
      [the] => 1
      [be] => 1
      [for] => 1
      [to] => 2
      [is] => 2
      [sample] => 2
      [test] => 2
      [used] => 2
      [this] => 2
      [text] => 3
  ) */

?>

答案 3 :(得分:2)

<?php
/**
 * Gets all the words as an array for a given text blob
 *
 * @param string $paragraph The pragraph in question
 * @return string[] Words found
 */
function getWords($paragraph) {
   //only lowercase
   $paragraph = strtolower($paragraph);
   //replace all non alpha num characters with spaces (this way periods won't screw
   //with our words)
   $paragraph = preg_replace("/[^a-z]/", " ", $paragraph);
   $paragraph = explode(" ", $paragraph);
   //get rid of empty words
   $paragraph = array_flip($paragraph);
   unset($paragraph[""]);
   $paragraph = array_flip($paragraph);
   return $paragraph;
}

$paragraph1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque sit amet ante
nisl. Morbi tempor varius semper. Suspendisse vel nisi dui. Sed tristique consectetur imperdiet.
Morbi nulla diam, lobortis non eleifend eget, ullamcorper nec tortor. Duis quis lectus felis.
In vulputate varius luctus. Maecenas gravida laoreet massa quis faucibus. Duis dictum, dui sit
amet pharetra laoreet, tortor nisi mattis tortor, et ornare purus dolor vitae ligula. Sed id
orci ut dolor fermentum imperdiet. Nulla non justo urna, in suscipit nunc. Donec ut nibh risus,
ut tempus mi. Proin fringilla pretium urna sed faucibus. Proin et porttitor sem. Nulla eros
arcu, sodales et aliquam in, pharetra et mauris. Duis placerat blandit justo at tincidunt.
Etiam eu rutrum arcu.";

$paragraph2 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sit amet leo id
arcu feugiat tempus quis a risus. Proin non nisi augue. Cras ultricies dignissim augue vel gravida.
Vivamus sed orci sed leo sollicitudin aliquet non at dui. Nulla facilisi. Suspendisse nunc nibh,
sollicitudin vitae tincidunt eget, aliquet vitae magna. Aliquam vehicula cursus ante, vitae rhoncus
orci egestas et. Fusce condimentum metus at metus auctor pellentesque. Suspendisse potenti. Morbi
blandit, leo sed eleifend pretium, augue dui interdum eros, vel faucibus felis dolor id elit. Nam
condimentum, odio at mattis consequat, sem eros molestie risus, a tempus dolor arcu sit amet justo.";

$common = array_intersect(getWords($paragraph1), getWords($paragraph2));
sort($common);
var_dump($common);
?>

答案 4 :(得分:-1)

  1. 拆分空格上的每个段落
  2. 从A段中选择一个令牌;如果是在段落B中,则将其放入“匹配”数组中。
  3. 重复步骤2,直到A段中没有其他标记。