我希望你能帮助我。
我创建了以下脚本,脚本的目的是动态创建元描述标记,以及页面中的关键字标记。
我仍然需要对元描述和关键字标签应用大小限制。分别为:$str2
和$key
$ key为我打印出meta关键字,这些关键字与页面相关。我的问题是如何从$key
变量中删除所有停用词?
<?php
$url = 'http://localhost/index.asp';
$url_content = file_get_contents($url);
//$url_content = strip_tags($url_content);
$str = $url_content;
/**
* Remove HTML tags, including invisible text such as style and
* script code, and embedded objects. Add line breaks around
* block-level tags to prevent word joining after tag removal.
*/
$str = preg_replace(
array(
// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@<h1[^>]*?.*?</h1>@siu',
'@<a[^>]*?.*?</a>@siu',
// Add line breaks before and after blocks
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0", "\n\$0", "\n\$0",
),
$str );
$str1 = strip_tags($str);
$str2 = strip_tags($str);
echo $str2.'<hr />';
$words = str_word_count(strtolower($str1),1);
$numWords = count($words);
//array_count_values()// returns an array using the values of the input array as keys and their frequency in input as values.
$word_count = (array_count_values($words));
arsort($word_count);
foreach ($word_count as $key=>$val) {
echo "$key, ";
}
?>
我找到了这个脚本
function del_stop_words($kw){
$kw = array_map('strtolower',array_diff($kw,array("")));
$sw = explode("\r\n",file_get_contents('http://localhost/stopwords.txt'));
return array_values(array_diff($kw,$sw));
}
但并非100%确定如何将其与上述脚本集成。我已经创建了停用词文件。我只需要资金从$key
变量中删除停用词。
由于
答案 0 :(得分:0)
试试这个:
$key = "I want to remove some bad words from my text, like sex racist etc...";
$swords = explode("\n", str_replace(array("\r\n", "\r"), "\n", file_get_contents('swords.txt')));
$key = str_replace($swords, "", $key );
echo $key; // echo's "I want to remove some bad words from my text, like etc..."
您的完整代码将代码如下:
<?php
$url = 'http://localhost/index.asp';
$url_content = file_get_contents($url);
//$url_content = strip_tags($url_content);
$str = $url_content;
/**
* Remove HTML tags, including invisible text such as style and
* script code, and embedded objects. Add line breaks around
* block-level tags to prevent word joining after tag removal.
*/
$str = preg_replace(
array(
// Remove invisible content
'@<head[^>]*?>.*?</head>@siu',
'@<style[^>]*?>.*?</style>@siu',
'@<script[^>]*?.*?</script>@siu',
'@<object[^>]*?.*?</object>@siu',
'@<embed[^>]*?.*?</embed>@siu',
'@<applet[^>]*?.*?</applet>@siu',
'@<noframes[^>]*?.*?</noframes>@siu',
'@<noscript[^>]*?.*?</noscript>@siu',
'@<noembed[^>]*?.*?</noembed>@siu',
'@<h1[^>]*?.*?</h1>@siu',
'@<a[^>]*?.*?</a>@siu',
// Add line breaks before and after blocks
'@</?((address)|(blockquote)|(center)|(del))@iu',
'@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
'@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
'@</?((table)|(th)|(td)|(caption))@iu',
'@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
'@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
'@</?((frameset)|(frame)|(iframe))@iu',
),
array(
' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
"\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
"\n\$0", "\n\$0", "\n\$0", "\n\$0",
),
$str );
$str1 = strip_tags($str);
$str2 = strip_tags($str);
echo $str2.'<hr />';
$words = str_word_count(strtolower($str1),1);
$numWords = count($words);
//array_count_values()// returns an array using the values of the input array as keys and their frequency in input as values.
$word_count = (array_count_values($words));
arsort($word_count);
$swords = explode("\n", str_replace(array("\r\n", "\r"), "\n", file_get_contents('swords.txt'))); // add this outside the loop
foreach ($word_count as $key=>$val) {
echo str_replace($swords, "", $key ).", ";
}
?>