PHP首先在源代码中提取链接

时间:2017-04-22 23:53:34

标签: php

我试图提取第一次出现的链接,就像这样开始

https://encrypted-tbn3.gstatic.com/images?...

来自页面的源代码。链接以"开头和结尾。这是我到目前为止所得到的:

$search_query = $array[0]['Name'];
$search_query = urlencode($search_query);
$context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla compatible')));
$response = file_get_contents( "https://www.google.com/search?q=$search_query&tbm=isch", false, $context);
$html = str_get_html($response);
$url = explode('"',strstr($html, 'https://encrypted-tbn3.gstatic.com/images?'[0]))

然而,$ url的输出不是我尝试提取的链接,而是一些非常不同的链接。我添加了一张图片。enter image description here

任何人都可以向我解释输出以及如何获得所需的链接吗?感谢

2 个答案:

答案 0 :(得分:1)

您似乎正在使用PHP Simple HTML DOM Parser 我通常使用DOMDocument,它是php内置类的一部分 以下是您需要的实例:

$search_query = $array[0]['Name'];
$search_query = urlencode($search_query);
$context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla compatible')));
$response = file_get_contents( "https://www.google.com/search?q=$search_query&tbm=isch", false, $context);

libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($response);

foreach ($dom->getElementsByTagName('img') as $item) {
    $img_src =  $item->getAttribute('src');
    if (strpos($img_src, 'https://encrypted') !== false) {
        print $img_src."\n";
    }
}

输出:

https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSumjp6e37O_86nc36mlktuWpbFuCI4nkkkocoBCYW3qCOicqdu_KEK-MY
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR_ttK8NlBgui_JndBj349UxZx0kHn0Z-Essswci-_5UQCmUOruY1PNl3M
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSydaTpSDw2mvU2JRBGEYUOstTUl4R1VhRevv1Sdinf0fxRvU26l3pTuqo
...

答案 1 :(得分:0)

$url_beginning = 'https://encrypted-tbn3.gstatic.com/images?';
if(preg_match('/\"(https\:\/\/encrypted\-tbn3\.gstatic\.com\/images\?.+?)\"/ui',$html, $matches))
    $url = $matches[1];
else
    $url = '';

尝试使用preg_replace,它更适合解析

在此示例中,假设您的HTML中的网址应该被引用。

<强> UPD 一点点调整版本可用于任何url-beginning:

$url_beginning = 'https://encrypted-tbn3.gstatic.com/images?';
$url_beginning = preg_replace('/([^а-яА-Яa-zA-Z0-9_@%\s])/ui', '\\\\$1', $url_beginning);
if(preg_match('/\"('.$url_beginning.'.+?)\"/ui',$html, $matches))
    $url = $matches[1];
else
    $url = '';