我正在尝试在PHP中创建一个页面来读取网页源代码,找到所有链接,然后为每个单独的链接(如果是一个html)自动下载我的电脑上的文件(更好的是没有问到哪里... )。
这是我的代码:
<?php
$srcUrl= 'http://www.justdogbreeds.com/all-dog-breeds.html';
$html = file_get_contents($srcUrl);
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
//finding the a tag
$hrefs = $xpath->evaluate("/html/body//a");
$testo = '<table width="100%" border="1" cellspacing="2" cellpadding="2" summary="layout">
<caption>
List of links
</caption>
<tr>
<th scope="col"> </th>
<th scope="col"> </th>
</tr>';
//Loop to display all the links and download
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
//if real link
if($url!='#')
{
//Code to get the file...
$data = file_get_contents($url);
//save as?
$filename = $url;
/*save the file...
$fh = fopen($filename,"w");
fwrite($fh,$data);
fclose($fh);*/
$hfile = fopen($data ,"r");
if($hfile){
while(!feof($hfile)){
$html=fgets($hfile,1024);
}
}
$fh = fopen($filename,"w");
fwrite($fh,$html);
fclose($fh);
//download automatically (better if without asking where... maybe in download folder)
header('Content-disposition: attachment; filename=' . $filename);
header("Content-Type: application/force-download");
header('Content-type: text/html');
//display link to the file you just saved...
$testo.='<tr>
<td>'.$url.'</td>
<td></td>
</tr>';
}
}
$testo.='</table>';
echo $testo;
?>
我做错了什么?
感谢
答案 0 :(得分:1)
你在搞几件事情。这是您当前的代码所做的:
$data = file_get_contents($url);
) - 这很好$hfile = fopen($data ,"r");
) - 不确定为什么需要这个,它实际上什么都不做,因为你试图打开的文件的名称是3.1中的内容,你真的不需要阅读任何内容 - 你已经拥有了网址的内容。$h = fopen
- &gt; fclose
的行),但是 - 这里有一些问题,因为您尝试创建的文件的名称是一个网址(即http://somedomain.sometld/somefile.html?t=1&r=2),您无法创建具有该名称的文件。您需要创建一个随机文件名。我在您的代码中做了一些更改,它应该可以工作:
<?php
$srcUrl= 'http://www.justdogbreeds.com/all-dog-breeds.html';
$html = file_get_contents($srcUrl);
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
//finding the a tag
$hrefs = $xpath->evaluate("/html/body//a");
$testo = '<table width="100%" border="1" cellspacing="2" cellpadding="2" summary="layout">
<caption>
List of links
</caption>
<tr>
<th scope="col"> </th>
<th scope="col"> </th>
</tr>';
$filename = 'list-of-links.html';
header('Content-disposition: attachment; filename=' . $filename);
header("Content-Type: application/force-download");
header('Content-type: text/html');
//Loop to display all the links and download
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
//if real link
if($url!='#') {
//Code to get the file...
$data = file_get_contents($url);
//save as?
$filename = mt_rand(10000000, 90000000) . ".html";
file_put_contents($filename, $data);
//display link to the file you just saved...
$testo.='<tr>
<td>'.$url.'</td>
<td></td>
</tr>';
}
}
$testo.='</table>';
echo $testo;
?>
我建议在每次请求后添加几秒钟的睡眠,以确保不会对服务器施加太大压力。