Question

我正在尝试在PHP中创建一个页面来读取网页源代码，找到所有链接，然后为每个单独的链接（如果是一个html）自动下载我的电脑上的文件（更好的是没有问到哪里... ）。

这是我的代码：

<?php

$srcUrl= 'http://www.justdogbreeds.com/all-dog-breeds.html';

$html = file_get_contents($srcUrl);

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);

//finding the a tag
$hrefs = $xpath->evaluate("/html/body//a");

$testo = '<table width="100%" border="1" cellspacing="2" cellpadding="2" summary="layout">
  <caption>
    List of links
  </caption>
  <tr>
    <th scope="col">&nbsp;</th>
        <th scope="col">&nbsp;</th>
  </tr>';

//Loop to display all the links and download
for ($i = 0; $i < $hrefs->length; $i++) {

       $href = $hrefs->item($i);
       $url = $href->getAttribute('href');

 //if real link
       if($url!='#')  

       {

 //Code to get the file...
 $data = file_get_contents($url);

 //save as?
 $filename = $url;

 /*save the file...
 $fh = fopen($filename,"w");
 fwrite($fh,$data);
 fclose($fh);*/

        $hfile = fopen($data ,"r");
        if($hfile){
            while(!feof($hfile)){
                $html=fgets($hfile,1024);
            }
        }
 $fh = fopen($filename,"w");
 fwrite($fh,$html);
 fclose($fh);

//download automatically (better if without asking where... maybe in download folder)
header('Content-disposition: attachment; filename=' . $filename);
header("Content-Type: application/force-download");
header('Content-type: text/html');

 //display link to the file you just saved...
    $testo.='<tr>
    <td>'.$url.'</td>
    <td></td>
    </tr>';
       }

}

$testo.='</table>';

echo $testo;

?>

我做错了什么？感谢

Answer 1

你在搞几件事情。这是您当前的代码所做的：

加载原始网址的内容
查找链接
每个链接：
1. 下载内容（$data = file_get_contents($url);） - 这很好
2. 打开新文件进行阅读（$hfile = fopen($data ,"r");） - 不确定为什么需要这个，它实际上什么都不做，因为你试图打开的文件的名称是3.1中的内容，你真的不需要阅读任何内容 - 你已经拥有了网址的内容。
3. 写下您刚刚阅读的文件的内容（$h = fopen - ＆gt; fclose的行），但是 - 这里有一些问题，因为您尝试创建的文件的名称是一个网址（即http://somedomain.sometld/somefile.html?t=1&r=2），您无法创建具有该名称的文件。您需要创建一个随机文件名。
4. 发送浏览器标题以下载HTML文件，其中包含您刚刚保存的文件的名称你在这里遇到了几个问题：首先你的标题乘以你在该页面上找到的链接数量，而你不需要它。您只需要发送一次这些标头。第二 - 你的文件名也有同样的问题。

我在您的代码中做了一些更改，它应该可以工作：

<?php
$srcUrl= 'http://www.justdogbreeds.com/all-dog-breeds.html';

$html = file_get_contents($srcUrl);

$dom = new DOMDocument();
@$dom->loadHTML($html);

// grab all the on the page
$xpath = new DOMXPath($dom);

//finding the a tag
$hrefs = $xpath->evaluate("/html/body//a");

$testo = '<table width="100%" border="1" cellspacing="2" cellpadding="2" summary="layout">
  <caption>
    List of links
  </caption>
  <tr>
    <th scope="col">&nbsp;</th>
        <th scope="col">&nbsp;</th>
  </tr>';

$filename = 'list-of-links.html';
header('Content-disposition: attachment; filename=' . $filename);
header("Content-Type: application/force-download");
header('Content-type: text/html');

//Loop to display all the links and download
for ($i = 0; $i < $hrefs->length; $i++) {
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    //if real link
    if($url!='#') {
        //Code to get the file...
        $data = file_get_contents($url);

        //save as?
        $filename = mt_rand(10000000, 90000000) . ".html";
        file_put_contents($filename, $data);

        //display link to the file you just saved...
        $testo.='<tr>
        <td>'.$url.'</td>
        <td></td>
        </tr>';
    }
}
$testo.='</table>';
echo $testo;
?>

我建议在每次请求后添加几秒钟的睡眠，以确保不会对服务器施加太大压力。

无法使用php自动下载所有链接作为separeted html文件

1 个答案: