Question

我在php脚本中解析外部网页的图片链接。这是我的模式：

$pattern = '/<img[^<>]+?src=["\']([^<>]+?)["\']/';

我在某些页面中找到了这样的标签：

<img class="avatar-32" src="<%= avatar %>" />

这就是为什么[^<>]而且我不知道如何获得其他潜在的错误标签

所以我想知道，如何完善我的模式以接受有效的img标签。

有类似的问题：

src与=和"之间是否有空格？
介于'＆lt;'和img之间？
甚至换行？
如果在src属性中找到'该怎么办？

实际上浏览器如何解析链接？

注意：我没有添加扩展名，因为链接可以是：

http://www.example.com/img.jpg?1234
http://www.example.com/img.php
http://www.example.com/img/

我还有一个相对于绝对链路转换器。所以转换不是问题

Answer 1

您最好使用DOMDocument。它有很多有用的功能来查找链接，textContent，操作dom等等。

例如，获取图像的网址：

$dom = new DOMDocument;
@$dom->loadHTML($response); //I assume that you're reading/curling pages

foreach ($dom->getElementsByTagName('img') as $node) {
    if ($node->hasAttribute('src')) {
        $url = $node->getAttribute('src');
        //Also you can do some regex here to validate urls 
        //and bypass those like "<%= avatar %>"
        echo $url,'<br>';
    }
}

这些方法也非常有用

$node->nodeValue //To get the textContent of the node
$node->childNodes //To get the children of the node. It will return a nodelist object 
                  //as getElementsByTagName('img')
$node->nodeType // Some nodes returned when calling childNodes are textnodes
                //so they can be bypassed with a conditional:
                //if( $node->nodeType == 1){//It's an element node}

$nodes->length // length of a nodelist object 
$nodes->item(1) // 2nd node of a nodelist

完美的图像链接解析器

1 个答案: