如何在php5中提取两段或多段之间的常用词? 我想总结每个文本可能会创建一个排名很高的单词列表 然后比较它们。
答案 0 :(得分:5)
我想最基本的方法是:
explode
或 preg_split
将每个段落拆分为单词数组
array_filter
,在这里,可能会有所帮助array_intersect
答案 1 :(得分:4)
可能有一种更快的方法,但你可以像标记符号那样拼写标点符号!? - 。/ \ @ @ $%^& *,然后将两个段落分解成一个数组,然后在两个数组上尝试array_intersect()。数组1中阵列1中的任何内容都应该作为匹配返回。
http://php.net/manual/en/function.array-intersect.php
理论上你应该收到一系列匹配的单词。从那里,排名取决于你以及你如何选择这样做。
答案 2 :(得分:2)
这样的事可能有用......
<?php
$paragraph = "hello this is some sample text. Sample text is usually used to test a program. For example, this sample text will be used to test the script below.";
$words = array();
preg_match_all('/\w+/', $paragraph, $matches);
foreach($matches[0] as $w){
$w = strtolower($w);
if(!array_key_exists($w, $words)){
$words[$w] = 0;
}
$words[$w]++;
}
asort($words);
echo print_r($words, true);
/* Output
Array (
[hello] => 1
[will] => 1
[example] => 1
[a] => 1
[program] => 1
[usually] => 1
[Sample] => 1
[script] => 1
[below] => 1
[some] => 1
[the] => 1
[be] => 1
[for] => 1
[to] => 2
[is] => 2
[sample] => 2
[test] => 2
[used] => 2
[this] => 2
[text] => 3
) */
?>
答案 3 :(得分:2)
<?php
/**
* Gets all the words as an array for a given text blob
*
* @param string $paragraph The pragraph in question
* @return string[] Words found
*/
function getWords($paragraph) {
//only lowercase
$paragraph = strtolower($paragraph);
//replace all non alpha num characters with spaces (this way periods won't screw
//with our words)
$paragraph = preg_replace("/[^a-z]/", " ", $paragraph);
$paragraph = explode(" ", $paragraph);
//get rid of empty words
$paragraph = array_flip($paragraph);
unset($paragraph[""]);
$paragraph = array_flip($paragraph);
return $paragraph;
}
$paragraph1 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque sit amet ante
nisl. Morbi tempor varius semper. Suspendisse vel nisi dui. Sed tristique consectetur imperdiet.
Morbi nulla diam, lobortis non eleifend eget, ullamcorper nec tortor. Duis quis lectus felis.
In vulputate varius luctus. Maecenas gravida laoreet massa quis faucibus. Duis dictum, dui sit
amet pharetra laoreet, tortor nisi mattis tortor, et ornare purus dolor vitae ligula. Sed id
orci ut dolor fermentum imperdiet. Nulla non justo urna, in suscipit nunc. Donec ut nibh risus,
ut tempus mi. Proin fringilla pretium urna sed faucibus. Proin et porttitor sem. Nulla eros
arcu, sodales et aliquam in, pharetra et mauris. Duis placerat blandit justo at tincidunt.
Etiam eu rutrum arcu.";
$paragraph2 = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam sit amet leo id
arcu feugiat tempus quis a risus. Proin non nisi augue. Cras ultricies dignissim augue vel gravida.
Vivamus sed orci sed leo sollicitudin aliquet non at dui. Nulla facilisi. Suspendisse nunc nibh,
sollicitudin vitae tincidunt eget, aliquet vitae magna. Aliquam vehicula cursus ante, vitae rhoncus
orci egestas et. Fusce condimentum metus at metus auctor pellentesque. Suspendisse potenti. Morbi
blandit, leo sed eleifend pretium, augue dui interdum eros, vel faucibus felis dolor id elit. Nam
condimentum, odio at mattis consequat, sem eros molestie risus, a tempus dolor arcu sit amet justo.";
$common = array_intersect(getWords($paragraph1), getWords($paragraph2));
sort($common);
var_dump($common);
?>
答案 4 :(得分:-1)