Question

我正在尝试计算curl请求返回的HTML文档中字符串的出现次数。我通常会使用substr_count执行此操作，但我希望仅匹配用户可见文本（浏览器加载页面中显示的文本），而不是源中的所有匹配项。例如，遇到以下段落：

<p class="example">example</p>

搜索字符串＆＃34;示例＆＃34;，我希望在这里计算一次，因为类名应该从计数中省略。我目前正在使用DOMXpath来解析HTML文档的其他部分，因此我也考虑使用它来实现此目的：

$xpath->query("//text()[contains(., 'example')]");

我发现其他人用来在文档中查找文本，但这似乎也计算了标签内的结果。有没有办法依靠用户可见的文字？我想要注意，用户可见只是意味着文本不是元数据，属性等的一部分。如果某个组件的样式设置为不可见，但是否则会产生可见文本，则应对该文本进行计数。例如：

<p class="example" style="visibility:hidden">example</p>

仍然应该像以前一样计算一次。

修改

strip_tags将处理我显示的实例。有没有办法处理脚本等中发现的实例？以下内容不应归因于计数：

<script type="text/javascript">var example = 1 ....other stuff....</script>

Answer 1

一种简单的方法是删除标签。

$str = '<p class="example">example</p>
<p class="example" style="visibility:hidden">example</p>
<script type="text/javascript">var example = 1 
....other stuff....
</script>';

$arr = explode(PHP_EOL, $str);

for($i = 0; $i < count($arr); $i++){

   if(strpos($arr[$i], "hidden") !== false){
       // remove hidden tag
       unset($arr[$i]);
   }else if(strpos($arr[$i], "<script") !== false){
        while(strpos($arr[$i], "</script") === false){
            // remove the scripts from the html. 
            unset($arr[$i]);
            $i++;
        }
        unset($arr[$i]); // and remove the last line with "</script"
   }
}
$str = implode(PHP_EOL, $arr);

Echo substr_count(strip_tags($str), "example");

https://3v4l.org/d4JN5

计算HTML文档中可见文本的出现次数

1 个答案: