Question

include('simple_html_dom.php');

  function curl_set($url){
   $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $result = curl_exec($ch); 
    return $result;  
   }

    $curl_scraped_page = curl_set('http://www.belmontwine.com/site-map.html');
    $html = new simple_html_dom();
    $html->load($curl_scraped_page, true, false);

    $i = 0; 
    $ab = array();
    $files = array();
         foreach($html->find('td[class=site-map]') as $td) {
           foreach($td->find('li a') as $a) {
         if($i<=2){
               $ab = 'http://www.belmontwine.com'.$a->href;
                   $html = file_get_html($ab);
            foreach($html->find('td[class=pageheader]') as $file) {
               $files[] = $file->innertext;
           }

          } 
        else{
          //exit();
         }    
          $i++;
        }
        $html->clear();
     }

print_r($files);

以上是我的代码，我需要帮助废弃网站使用php。

$ab变量包含从网站上抓取的网址。我想从这些网址中删除数据。我不知道脚本有什么问题。所需的输出是$ ab传递的url .. 但它没有返回任何东西......只是一个连续循环...

需要帮助

Answer 1

你有一个逃跑程序，因为一旦你进入if（$ i＆lt; = 2）部分，你永远不会增加i变量。现在你的i ++在错误的地方。我不知道你为什么要将发现限制在3或更少，但你需要记住将i变量重置为0，你根本就没有这样做。

修改

我没有使用这个课程＆＃39; simple_html_dom.php＆＃39;所以我不太了解它。而且我不知道你想要对发现的每个链接做什么。我无法为你做这项工作。我想出了这个示例php脚本，它抓取了站点地图页面中的所有链接。它创建一个由链接标题和href路径组成的数组。最后一个foreach循环现在只打印数组，但您可以使用该循环来处理找到的每个路径。

include('simple_html_dom.php'); $files = array(); $html = file_get_html('http://www.belmontwine.com/site-map.html'); foreach($html->find('td[class=site-map]') as $td) { foreach($td->find('li a') as $a) { if($a->plaintext != '') { $files["$a->plaintext"] = "http://www.belmontwine.com/$a->href"; } } } // To print $files array or to process each link found foreach($files as $title => $path) { echo('Title: ' . $title . ' - Path: ' . $path . '<br>' . PHP_EOL); }

此外，并非所有链接都是html文件，至少1是pdf，所以请务必在代码中测试。

用简单的Dom模型进行php刮擦

1 个答案: