清理一系列单词

时间:2012-09-20 15:40:43

标签: php

我有一个删除html并将单词放在数组中然后使用array_count_values的函数。我试图报告每个单词的出现次数。阵列输出非常混乱。我试图清理它,但我无处可去。我想删除电话号码,由于某种原因,短语被推到一起。第一个数组似乎也是null,但isset()或empty()似乎没有取消它。

$body = $this->get_response($domain);
                $body = preg_replace('/<body(.*?)>/i', '<body>', $body);
                $body = preg_replace('#</body>#i', '</body>', $body);

                $openTag = '<body>';
                $start = strpos($body, $openTag);
                $start += strlen($openTag);

                $closeTag = '</body>';
                $end = strpos($body, $closeTag);

                // Return if cannot cut-out the body
                if ($end <= $start || $start === false || $end === false) {
                    $this->setValue('');
                    return;
                }

                $body = substr($body, $start, $end - $start);
                $body = preg_replace(array(
                       '@<script[^>]*?>.*?</script>@si',    // Strip out javascript
                       '@<style[^>]*?>.*?</style>@siU',     // Strip style tags properly
                       '@<![\s\S]*?--[ \t\n\r]*>@',         // Strip multi-line comments including CDATA
                       '/style=([\"\']??)([^\">]*?)\\1/siU',// Strip inline style attribute
                       ), '', $body);

                $body = strip_tags($body);
                $body = array_filter(explode(' ', $body), create_function('$str', 'return strlen($str) > 2;'));
                $body = array_map('trim', $body);
                $words = $body;

                $i = 0;

                $words = array_count_values($words);

                foreach($words as $word){

                    if (empty($word)) unset($words[$i]);
                    $i++;

                }

                echo "<pre>";
                    print_r($words);
                    echo "</pre>";

输出

Array
(
    [] => 28
    [333.444.5555] => 1
    [facebook] => 2
    [twitter] => 2
    [linkedin] => 2
    [youtube

                googleplus] => 1
    [About

    History
    Our] => 1
    [Mission
    Who] => 1
    [This
     That
     Other] => 1
    [Us


English

    FA
    Football] => 1
    [Media
    Pay] => 2
    [Per] => 4
    [Think
    Fast] => 2
    [Marketing
    Design] => 1
    [Consulting


Case] => 2

1 个答案:

答案 0 :(得分:1)

我担心explode(' ', $body)是不够的,因为空间不是唯一的空格字符。请改为preg_split

$body = array_filter(preg_split('/\s+/', $body), 
            create_function('$str', 'return strlen($str) > 2;'));