php从网站站点地图抓取所有iframes src

时间:2016-08-11 09:42:37

标签: php arrays simple-html-dom

我正在尝试抓取一个网站,以便创建其中所有iframes源的列表,为了这样做,我使用了simple_html_dom php库。

我解析每个链接,在每个链接中我搜索iframe,iframe存在,我要求代码为我带来src和从中获取的页面。

我在做的是:

  1. 抓住所需页面中的所有链接。

  2. 将它们转换为一个大数组,以避免服务器崩溃。

  3. 所有链接都是相对的,所以我在开头添加主网址。
  4. 我循环所有网址并在网页中搜索iframe。
  5. 会发生什么事情,它适用于20行,而不是我得到这个错误:

    Warning: file_get_contents(): php_network_getaddresses: getaddrinfo failed: No such host is known. in C:\xampp\htdocs\scrap\simple_html_dom.php on line 75
    
    Warning: file_get_contents(http://www.achva.ac.ilhttp://www.achva.ac.il/לימודי-תעודה-והשתלמויות): failed to open stream: php_network_getaddresses: getaddrinfo failed: No such host is known. in C:\xampp\htdocs\scrap\simple_html_dom.php on line 75
    
    Fatal error: Call to a member function find() on a non-object in C:\xampp\htdocs\scrap\scrap.php on line 61
    

    由于某些原因我一直收到错误:

    这是我的代码:

        <!DOCTYPE html>
    <html>
    <head>
        <title></title>
    
        <style type="text/css">
            th{
                font-weight: 800;
                border: 1px solid lightblue;
            }
            td{
                border: 1px solid lightblue;
            }
        </style>
    </head>
    <body>
    <?php 
    
    $html = file_get_contents('http://www.achva.ac.il/sitemap');
    //Create a new DOM document
    $dom = new DOMDocument;
    
    //Parse the HTML. The @ is used to suppress any parsing errors
    //that will be thrown if the $html string isn't valid XHTML.
    @$dom->loadHTML($html);
    
    //Get all links. You could also use any other tag name here,
    //like 'img' or 'table', to extract other tags.
    $links = $dom->getElementsByTagName('a');
    
    //Iterate over the extracted links and display their URLs
    $arr = [];
    foreach ($links as $link){
        array_push($arr, 'http://www.achva.ac.il'.$link->getAttribute('href'));
    }
    
    
    $result = count($arr);
    echo $result;
    ?>
    
    <?php  
    
    
    function urlOk($url) {
        $headers = @get_headers($url);
        if($headers[0] == 'HTTP/1.1 200 OK') return true;
        else return false;
    }
    
    */
    include('simple_html_dom.php');
    
    echo '<table><tr><th>id</th><th>Video src</th><th>Site page</th></tr>';
    $i = 19;
    $page_number = 1;
    
    foreach($arr as $urlx){
    
        echo $urlx;
    
            $scrap_url = file_get_html($urlx);
    
            if (preg_match('#^http?://(?:[^.]+\.)*achva\.ac.il/#i', $urlx))     
            {
    
    
            $div = $scrap_url->find('iframe');
            if($div){
    
                foreach ($div as $key) {
    
                    echo '<tr><td>' . $i . '</td>';
    
                    $src = $key->attr['src'];
                    echo '<td>' . $src . '</td>';
                    echo '<td>' .$urlx . '</td></tr>';
                    $page_number++;
    
                }
    
            }else{
                echo '<tr><td>' . $i . '</td><td>no iframe in this tage</td><td>' . $urlx . '</td></tr>';
    
            }
    
            $i++;
        }
    }
    ?>
    </table>
    </body>
    </html>
    

0 个答案:

没有答案
相关问题