我有一个删除html并将单词放在数组中然后使用array_count_values的函数。我试图报告每个单词的出现次数。阵列输出非常混乱。我试图清理它,但我无处可去。我想删除电话号码,由于某种原因,短语被推到一起。第一个数组似乎也是null,但isset()或empty()似乎没有取消它。
$body = $this->get_response($domain);
$body = preg_replace('/<body(.*?)>/i', '<body>', $body);
$body = preg_replace('#</body>#i', '</body>', $body);
$openTag = '<body>';
$start = strpos($body, $openTag);
$start += strlen($openTag);
$closeTag = '</body>';
$end = strpos($body, $closeTag);
// Return if cannot cut-out the body
if ($end <= $start || $start === false || $end === false) {
$this->setValue('');
return;
}
$body = substr($body, $start, $end - $start);
$body = preg_replace(array(
'@<script[^>]*?>.*?</script>@si', // Strip out javascript
'@<style[^>]*?>.*?</style>@siU', // Strip style tags properly
'@<![\s\S]*?--[ \t\n\r]*>@', // Strip multi-line comments including CDATA
'/style=([\"\']??)([^\">]*?)\\1/siU',// Strip inline style attribute
), '', $body);
$body = strip_tags($body);
$body = array_filter(explode(' ', $body), create_function('$str', 'return strlen($str) > 2;'));
$body = array_map('trim', $body);
$words = $body;
$i = 0;
$words = array_count_values($words);
foreach($words as $word){
if (empty($word)) unset($words[$i]);
$i++;
}
echo "<pre>";
print_r($words);
echo "</pre>";
输出
Array
(
[] => 28
[333.444.5555] => 1
[facebook] => 2
[twitter] => 2
[linkedin] => 2
[youtube
googleplus] => 1
[About
History
Our] => 1
[Mission
Who] => 1
[This
That
Other] => 1
[Us
English
FA
Football] => 1
[Media
Pay] => 2
[Per] => 4
[Think
Fast] => 2
[Marketing
Design] => 1
[Consulting
Case] => 2
答案 0 :(得分:1)
我担心explode(' ', $body)
是不够的,因为空间不是唯一的空格字符。请改为preg_split
。
$body = array_filter(preg_split('/\s+/', $body),
create_function('$str', 'return strlen($str) > 2;'));