从URL动态创建关键字和描述 - php停用词问题

时间:2011-06-03 23:06:34

标签: php preg-replace array-difference

我希望你能帮助我。

我创建了以下脚本,脚本的目的是动态创建元描述标记,以及页面中的关键字标记。

我仍然需要对元描述和关键字标签应用大小限制。分别为:$str2$key

$ key为我打印出meta关键字,这些关键字与页面相关。我的问题是如何从$key变量中删除所有停用词?

   <?php 
    $url = 'http://localhost/index.asp';
    $url_content = file_get_contents($url);
    //$url_content = strip_tags($url_content);
    $str = $url_content;

    /**
     * Remove HTML tags, including invisible text such as style and
     * script code, and embedded objects.  Add line breaks around
     * block-level tags to prevent word joining after tag removal.
     */
        $str = preg_replace(
            array(
              // Remove invisible content
                '@<head[^>]*?>.*?</head>@siu',
                '@<style[^>]*?>.*?</style>@siu',
                '@<script[^>]*?.*?</script>@siu',
                '@<object[^>]*?.*?</object>@siu',
                '@<embed[^>]*?.*?</embed>@siu',
                '@<applet[^>]*?.*?</applet>@siu',
                '@<noframes[^>]*?.*?</noframes>@siu',
                '@<noscript[^>]*?.*?</noscript>@siu',
                '@<noembed[^>]*?.*?</noembed>@siu',
                '@<h1[^>]*?.*?</h1>@siu',
                '@<a[^>]*?.*?</a>@siu',
              // Add line breaks before and after blocks
                '@</?((address)|(blockquote)|(center)|(del))@iu',
                '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
                '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
                '@</?((table)|(th)|(td)|(caption))@iu',
                '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
                '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
                '@</?((frameset)|(frame)|(iframe))@iu',
            ),
            array(
                ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
                "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
                "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            ),
            $str );
        $str1 =  strip_tags($str);
        $str2 =  strip_tags($str);
        echo $str2.'<hr />';

    $words = str_word_count(strtolower($str1),1);
    $numWords = count($words);
    //array_count_values()// returns an array using the values of the input array as keys and their frequency in input as values.
    $word_count = (array_count_values($words));
    arsort($word_count);

    foreach ($word_count as $key=>$val) {
    echo "$key, ";
    }
    ?>

我找到了这个脚本

function del_stop_words($kw){
 $kw = array_map('strtolower',array_diff($kw,array("")));
 $sw = explode("\r\n",file_get_contents('http://localhost/stopwords.txt'));
 return array_values(array_diff($kw,$sw));
 }

但并非100%确定如何将其与上述脚本集成。我已经创建了停用词文件。我只需要资金从$key变量中删除停用词。

由于

1 个答案:

答案 0 :(得分:0)

试试这个:

$key = "I want to remove some bad words from my text, like sex racist etc...";
$swords = explode("\n", str_replace(array("\r\n", "\r"), "\n", file_get_contents('swords.txt')));
$key = str_replace($swords, "", $key );
echo $key; // echo's "I want to remove some bad words from my text, like etc..."

您的完整代码将代码如下:

<?php 
    $url = 'http://localhost/index.asp';
    $url_content = file_get_contents($url);
    //$url_content = strip_tags($url_content);
    $str = $url_content;

    /**
     * Remove HTML tags, including invisible text such as style and
     * script code, and embedded objects.  Add line breaks around
     * block-level tags to prevent word joining after tag removal.
     */
        $str = preg_replace(
            array(
              // Remove invisible content
                '@<head[^>]*?>.*?</head>@siu',
                '@<style[^>]*?>.*?</style>@siu',
                '@<script[^>]*?.*?</script>@siu',
                '@<object[^>]*?.*?</object>@siu',
                '@<embed[^>]*?.*?</embed>@siu',
                '@<applet[^>]*?.*?</applet>@siu',
                '@<noframes[^>]*?.*?</noframes>@siu',
                '@<noscript[^>]*?.*?</noscript>@siu',
                '@<noembed[^>]*?.*?</noembed>@siu',
                '@<h1[^>]*?.*?</h1>@siu',
                '@<a[^>]*?.*?</a>@siu',
              // Add line breaks before and after blocks
                '@</?((address)|(blockquote)|(center)|(del))@iu',
                '@</?((div)|(h[1-9])|(ins)|(isindex)|(p)|(pre))@iu',
                '@</?((dir)|(dl)|(dt)|(dd)|(li)|(menu)|(ol)|(ul))@iu',
                '@</?((table)|(th)|(td)|(caption))@iu',
                '@</?((form)|(button)|(fieldset)|(legend)|(input))@iu',
                '@</?((label)|(select)|(optgroup)|(option)|(textarea))@iu',
                '@</?((frameset)|(frame)|(iframe))@iu',
            ),
            array(
                ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ',
                "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0", "\n\$0",
                "\n\$0", "\n\$0", "\n\$0", "\n\$0",
            ),
            $str );
        $str1 =  strip_tags($str);
        $str2 =  strip_tags($str);
        echo $str2.'<hr />';

    $words = str_word_count(strtolower($str1),1);
    $numWords = count($words);
    //array_count_values()// returns an array using the values of the input array as keys and their frequency in input as values.
    $word_count = (array_count_values($words));
    arsort($word_count);

$swords = explode("\n", str_replace(array("\r\n", "\r"), "\n", file_get_contents('swords.txt'))); // add this outside the loop


    foreach ($word_count as $key=>$val) {
    echo str_replace($swords, "", $key ).", ";
    }
    ?>