我目前正在抓取HTML页面并使用以下方式计算页面上的单个单词:
$page_content = file_get_html($url)->plaintext;
$word_array = array_count_values(str_word_count(strip_tags(strtolower($page_content));
这对计算单个单词非常有用。
但我正在尝试计算最多约3个单词的短语。
例如:
$string = 'the best stack post';
计数将返回:
the = 1
best = 1
stack = 1
post = 1
我需要从短语中删除短语,因此该字符串中的三个单词短语可能是:
the best stack = 1
best stack post = 1
我希望这是有道理的!
我已经搜索过,但在PHP中无法找到任何方法。
有什么想法吗?
答案 0 :(得分:0)
我要做的是获取页面内容并删除html标签。然后通过典型的短语分隔符(即点(。))分解文本。现在你有一系列单个短语,你可以计算单个单词:
$page_content = file_get_html($url)->plaintext;
$text = strip_tags(strtolower($page_content));
$phrases = explode(".", $text);
$count = 0;
foreach ($phrases as $phrase) {
if (str_word_count($phrase) >= 3) {
$count++;
}
}
答案 1 :(得分:0)
这个解决方案有两个步骤。
.
终止。)以下是代码:
function threeWords($string) {
$words = array_values(array_filter(preg_split("!\W!",$string))); //Split on non-word characters. Not ideal probably since it will count "non-hyphenated" as 2 words.
if (count($words) < 3) { return []; }
$phrases = [];
for ($i = 2;$i < count($words);$i++) {
$phrases[] = $words[$i-2]." ".$words[$i-1]." ".$words[$i];
}
return $phrases;
}
$page_content = file_get_html($url)->plaintext;
$text = strip_tags(strtolower($page_content));
$sentences = explode(".",$text);
$phrases = [];
foreach ($sentences as $sentence) {
$phrases = array_merge($phrases,threeWords(trim($sentence)));
}
$count = array_count_values($phrases);
print_r($count);
答案 2 :(得分:0)
// Split the string into sentences on the appropriate punctuation marks
// and loop over the sentences
foreach (preg_split('/[?.!]/', $string) as $sentence) {
// split the sentences into words (remove any empty strings with array_filter)
$words = array_filter(explode(' ', $sentence));
// take the first set of three words from the sentence, then remove the first word,
// until the sentence is gone.
while ($words) {
$phrase = array_slice($words, 0, 3);
// check that the phrase is the correct length
if (count($phrase) == 3) {
// convert it back to a string
$phrase = implode(' ', $phrase);
// increment the count for that phrase in your result
if (!isset($phrases[$phrase])) $phrases[$phrase] = 0;
$phrases[$phrase]++;
}
array_shift($words);
}
}