Question

我正在使用SIMPLE_HTML_DOM和解析时测试解析器从此网址返回的HTML DOM：HERE

找不到H1元素...... 我试着成功地回归所有的div。

我正在使用一个简单的请求来诊断此问题：

foreach($html->find('H1') as $value) { echo "<br />F: ".htmlspecialchars($value); }

在查看源代码时，我意识到：

h1是大写 - ＆gt; H1 - 但是SIMPLE_HTML ......正在处理：

            //PaperG - If lowercase is set, do a case insensitive test of the value of the selector.
        if ($lowercase) {
            $check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));
        } else {
            $check = $this->match($exp, $val, $nodeKeyValue);
        }
        if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));}

任何人都可以帮我理解这里发生的事情吗？

Answer 1

发现它......

但是无法解释它！

我测试了另一个代码，包括H1（大写）并且它有效。

在使用SIMPLE_HTML_DOM代码时，我评论了“remove_noise”，现在它正在工作完美，我认为这是因为这个网站有无效的HTML和噪音去除器正在删除太多，并且在结束标记脚本之后没有结束：

    // $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");
    // $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");

谢谢大家的帮助。

Answer 2

试试这个

        $oHtml = str_get_html($html);
        foreach($oHtml->find('h1') as $element)
        {
            echo $element->innertext;
        }

您还将使用正则表达式函数返回所有h1标签的innertext

的数组

  function getH1($yourhtml)
{
    $h1tags = preg_match_all("/(<h1.*>)(\w.*)(<\/h1>)/isxmU", $yourhtml, $patterns);
    $res    = array();
    array_push($res, $patterns[2]);
    array_push($res, count($patterns[2]));
    return $res;
}

simple_html_dom没有返回<h1>元素？</h1>

2 个答案: