PHP简单的HTML DOM解析器获取图像错误

时间:2018-01-28 09:21:18

标签: php html css

我尝试使用Simple HTML DOM Parser从指定的URL下载所有图像。 它可以将图像成功导入我的文件夹(我已经创建了文件夹" hinh"之前),但在此之后,它仍会返回很多关于file_get_contents()file_put_contents()的错误。

错误返回到我的屏幕:

Warning: file_get_contents(): Filename cannot be empty in D:\xampp\htdocs\cra\index.php on line 11

Warning: file_put_contents(hinh/): failed to open stream: No such file or directory in D:\xampp\htdocs\cra\index.php on line 11

我已经创建了文件夹" hinh"之前,但它仍然有一个错误。这是我的代码:

<?php
    include('simple_html_dom.php'); //using PHP Simple HTML DOM Parser to get element in your link
    if(isset($_POST['submit'])){    //Because I want to make it visible to get multiple links, so I have designed a simple form to input your link. If you click on button, it will do job inside brackets, else do nothing. To make sure it work perfectly, you should check either $_POST['Url']. 
    $url=file_get_html($_POST['url']); //Here I get parameter at textfield to get URL
    $image = $url->find("img"); // Find all <img> tag in your link 
    foreach($image as $img) //Reach to every single line <img> in the destination link
    {
        $s=$img->src; //Get link of image

        $img_name = 'hinh/'.basename($s); //The important step in here. If you want to get file name of image and parse it to fuction save it to your disk (or host, of course!), you have to get file name of it, not a link.      
        file_put_contents($img_name, file_get_contents($s)); //Catch image and store it into place that specified before.
    }
    }   
?>

这是index.php

<form id="form1" name="form1" method="post" action="index.php">
  <table width="700" border="1" align="center" cellpadding="1" cellspacing="1">
    <tr>
      <td colspan="2"><label for="textfield"></label>
      <input style="width:100%;" type="text" name="url" id="textfield" />
      </td>

    </tr>
    <tr>
      <td colspan="2" align="center" valign="middle"><input type="submit" name="submit" id="button" value="Submit" /></td>
    </tr>
  </table>
</form>

1 个答案:

答案 0 :(得分:0)

您将每个图片网址视为绝对网址。让我们假设您正在抓取的页面位于http://example.com/pages/index.html

这有效:<img src="http://example.com/images/1.jpg" />在浏览器中解析为http://example.com/images/1.jpg,在代码中解析为http://example.com/images/1.jpg

这不起作用:<img src="/images/1.jpg" />在浏览器中解析为http://example.com/images/1.jpg,在代码中解析为/images/1.jpg

您必须检查图像src是否包含相对URL或绝对URL。否则,您将在文件系统中搜索图像,这可能会危及敏感数据(例如<img src="/etc/shadow" />)。

编辑:

在示例页面中,有一些带有空src属性的img标签,这些属性使用javascript加载。你可以跳过这样的:

<?php
    include('simple_html_dom.php');
    if(isset($_POST['submit'])) {
        $url=file_get_html($_POST['url']);
        $image = $url->find("img");
        foreach($image as $img) {
            if(!empty($img->src)) {
                $s=$img->src;
                $img_name = 'hinh/'.basename($s);
                file_put_contents($img_name, file_get_contents($s));
            }
        }
    }   
?>

请注意,在测试时,我发现在某些情况下,简单的html dom无法加载远程文件,它的目的是加载本地文件,所以我使用它来代替它,它可能是一个更稳定的解决方案:

$html  = file_get_contents($_POST['url']);
$url   = str_get_html($html);
$image = $url->find("img");
...