使用PHP在文本文件中查找关键字arsort()

时间:2014-04-18 17:13:43

标签: php sorting search keyword meta

我正在尝试使用脚本搜索文本文件并返回符合特定条件的单词:

*这个词只列出一次 *它们不是忽略列表中的一个单词 *他们是最长词的前10% *他们不重复信件 *最终名单将是符合上述标准的随机十名。 *如果上述任何内容均为假,则报告的单词将为空。

我把以下内容放在一起,但脚本在arsort()处死,说它需要一个数组。任何人都可以提出改变以使arsort工作吗?或者建议一个替代(更简单)的脚本来查找元数据?**我意识到第二个问题可能是一个更适合另一个StackExchange的问题。

<?php
  $fn="../story_link";
  $str=readfile($fn);
    function top_words($str, $limit=10, $ignore=""){
        if(!$ignore) $ignore = "the of to and a in for is The that on said with be was by"; 
        $ignore_arr = explode(" ", $ignore);
        $str = trim($str);
        $str = preg_replace("#[&].{2,7}[;]#sim", " ", $str);
        $str = preg_replace("#[()°^!\"§\$%&/{(\[)\]=}?´`,;.:\-_\#'~+*]#", " ", $str);
        $str = preg_replace("#\s+#sim", " ", $str);
        $arraw = explode(" ", $str);
        foreach($arraw as $v){
            $v = trim($v);
            if(strlen($v)<3 || in_array($v, $ignore_arr)) continue;
            $arr[$v]++;
        }
        arsort($arr);   
        return array_keys( array_slice($arr, 0, $limit) );
    }
    $meta_keywords = implode(", ", top_words( strip_tags( $html_content ) ) );
?>

2 个答案:

答案 0 :(得分:2)

问题是当你的循环永远不会增加$ arr [$ v]时,这会导致$ arr无法定义。这是你的错误的原因,因为arsort()被赋予null作为其参数 - 而不是数组。

解决方案是在$ arr [$ v] ++;的实例循环之前将$ arr定义为数组。没有执行。

function top_words($str, $limit=10, $ignore=""){
    if(!$ignore) $ignore = "the of to and a in for is The that on said with be was by"; 
    $ignore_arr = explode(" ", $ignore);
    $str = trim($str);
    $str = preg_replace("#[&].{2,7}[;]#sim", " ", $str);
    $str = preg_replace("#[()°^!\"§\$%&/{(\[)\]=}?´`,;.:\-_\#'~+*]#", " ", $str);
    $str = preg_replace("#\s+#sim", " ", $str);
    $arraw = explode(" ", $str);
    $arr = array(); // Defined $arr here.
    foreach($arraw as $v){
        $v = trim($v);
        if(strlen($v)<3 || in_array($v, $ignore_arr)) continue;
        $arr[$v]++;
    }
    arsort($arr);   
    return array_keys( array_slice($arr, 0, $limit) );
}

答案 1 :(得分:0)

遇到了一个很好的代码:

        <?php
    function extract_keywords($str, $minWordLen = 3, $minWordOccurrences = 2, $asArray = false, $maxWords = 5, $restrict = true)
    {
        $str = str_replace(array("?","!",";","(",")",":","[","]"), " ", $str);
        $str = str_replace(array("\n","\r","  "), " ", $str);
        strtolower($str);

        function keyword_count_sort($first, $sec)
        {
            return $sec[1] - $first[1];
        }
        $str = preg_replace('/[^\p{L}0-9 ]/', ' ', $str);
        $str = trim(preg_replace('/\s+/', ' ', $str));

        $words = explode(' ', $str);

        // If we don't restrict tag usage, we'll remove common words from array
        if ($restrict == false) {
        $commonWords = array('a','able','about','above', 'get a list here http://www.wordfrequency.info','you\'ve','z','zero');
        $words = array_udiff($words, $commonWords,'strcasecmp');
        }

        // Restrict Keywords based on values in the $allowedWords array
        // Use if you want to limit available tags
        if ($restrict == true) {
        $allowedWords =  array('engine','boeing','electrical','pneumatic','ice','pressurisation');
        $words = array_uintersect($words, $allowedWords,'strcasecmp');
        }

        $keywords = array();

        while(($c_word = array_shift($words)) !== null)
        {
            if(strlen($c_word) < $minWordLen) continue;

            $c_word = strtolower($c_word);
            if(array_key_exists($c_word, $keywords)) $keywords[$c_word][1]++;
            else $keywords[$c_word] = array($c_word, 1);
        }
        usort($keywords, 'keyword_count_sort');

        $final_keywords = array();
        foreach($keywords as $keyword_det)
        {
            if($keyword_det[1] < $minWordOccurrences) break;
            array_push($final_keywords, $keyword_det[0]);
        }
        $final_keywords = array_slice($final_keywords, 0, $maxWords);
        return $asArray ? $final_keywords : implode(', ', $final_keywords);
    }


    $text = "Many systems that traditionally had a reliance on the pneumatic system have been transitioned to the electrical architecture. They include engine start, API start, wing ice protection, hydraulic pumps and cabin pressurisation. The only remaining bleed system on the 787 is the anti-ice system for the engine inlets. In fact, Boeing claims that the move to electrical systems has reduced the load on engines (from pneumatic hungry systems) by up to 35 percent (not unlike today’s electrically power flight simulators that use 20% of the electricity consumed by the older hydraulically actuated flight sims).";

    echo extract_keywords($text);

    // Advanced Usage
    // $exampletext = "The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog. The quick brown fox jumped over the lazy dog.";
    // echo extract_keywords($exampletext, 3, 1, false, 5, false);
    ?>