使用PHP选择HTML内容

时间:2016-03-19 19:47:47

标签: php

我想获得此标记下的段落:

Here's an image

我试图:

<?php

    $doc = new DOMDocument();
    $doc->loadHTMLFile("https://sabq.org/xMQjz2");

    $elements = $doc->getElementsByTagName('p');

    if (!is_null($elements)) {

        foreach ($elements as $element) {

            $nodes = $element->childNodes;
            foreach ($nodes as $node) {
                  echo $node->textContent. "\n";
            }
        } 
   }

?>

我得到了我想要的段落和不需要的段落,并且它们是重复的。

修改 我更改了网址,希望它有效

1 个答案:

答案 0 :(得分:0)

您提供的链接在访问时会引发错误,因此我发现了一个函数,可以使用curl而不是您正在使用的DOMDocument类来获取网页的内容。

我使用preg_match和regex来提取您正在寻找的特定元素。

以下是代码:

    <?php

    //opened url
    $content = get_fcontent("https://sabq.org/%D8%B4%D8%A7%D9%87%D8%AF-%D8%A3%D9%84%D9%81-%D8%B5%D9%81%D8%AD%D8%A9-%D8%AA%D8%B1%D9%88%D9%8A-%D9%82%D8%B5%D8%B5-%D8%A7%D9%84%D8%AD%D8%B1%D9%85%D9%8A%D9%86-%D9%85%D9%86%D8%B0-%D8%A7%D9%86%D8%B7%D9%84%D8%A7%D9%82-%D8%A7%D9%84%D8%B9%D9%87%D8%AF-%D8%A7%D9%84%D8%B3%D8%B9%D9%88%D8%AF%D9%8A");

    //extract specific html tag and its innerHTML
    preg_match('/<p .*? ng\-bind\-html\=\"getContent\(material\.content\)\" .*?>.*?<\/p>/m', $content[0], $matches);

    //display the wanted element
    echo $matches[0];

    //getting contents using curl because threw error: failed to open stream
    function get_fcontent( $url,  $javascript_loop = 0, $timeout = 5 ) {
        $url = str_replace( "&amp;", "&", urldecode(trim($url)) );

        $cookie = tempnam ("/tmp", "CURLCOOKIE");
        $ch = curl_init();
        curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" );
        curl_setopt( $ch, CURLOPT_URL, $url );
        curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie );
        curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true );
        curl_setopt( $ch, CURLOPT_ENCODING, "" );
        curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true );
        curl_setopt( $ch, CURLOPT_AUTOREFERER, true );
        curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false );    # required for https urls
        curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout );
        curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout );
        curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 );
        $content = curl_exec( $ch );
        $response = curl_getinfo( $ch );
        curl_close ( $ch );

        if ($response['http_code'] == 301 || $response['http_code'] == 302) {
            ini_set("user_agent", "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1");

            if ( $headers = get_headers($response['url']) ) {
                foreach( $headers as $value ) {
                    if ( substr( strtolower($value), 0, 9 ) == "location:" )
                        return get_url( trim( substr( $value, 9, strlen($value) ) ) );
                }
            }
        }

    if (    ( preg_match("/>[[:space:]]+window\.location\.replace\('(.*)'\)/i", $content, $value) || preg_match("/>[[:space:]]+window\.location\=\"(.*)\"/i", $content, $value) ) && $javascript_loop < 5) {
        return get_url( $value[1], $javascript_loop+1 );
    } else {
        return array( $content, $response );
    }
}
?>

为了测试,我创建了一个名为test.html的本地文件:

<!DOCTYPE html>
<html>
<head>
    <title></title>
</head>
<body>
<p>This should not be showing.</p>
<p  ng-bind-html="getContent(material.content)" id="dev-content" class="details-text">This is a test.</p>
</body>
</html>

我使用了本地网址http://localhost/example/test.html,而不是您为测试目的提供的链接。

从我为测试创建的本地文件中,我得到了以下结果:

<p  ng-bind-html="getContent(material.content)" id="dev-content" class="details-text">This is a test.</p>

这是我从原始网址获得的结果:

<p  ng-bind-html="getContent(material.content)" id="dev-content" class="details-text"></p>

我希望这有帮助!