如何获得完整的网址?

时间:2013-06-29 12:48:20

标签: php

$html = file_get_contents("any site");
$dom = new domDocument;
@$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');

foreach ($images as $image) {
   echo $image->src;
}

什么都不给我回复

$html = file_get_contents("any site");
$dom = new domDocument;
@$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');

foreach ($images as $image) {
   echo $image->getAttribute('src');
}

返回我的相对网址,例如“/images/example.jpg

$html = file_get_contents("any site");
$dom = new domDocument;
@$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');

foreach ($images as $image) {
   echo $image.src;
}

回复我:

Fatal error: Call to undefined function getElementsByTagName()

那么,我怎样才能获得绝对路径?

4 个答案:

答案 0 :(得分:1)

您可以使用parse_url查找基本网址:

$url = 'http://www.example.com/path?opt=234';
$parts = parse_url($url);
if (isset($parts['scheme'])){
    $base_url = $parts['scheme'].'://';
} else {
    $base_url = 'http://';
    $parts = parse_url($base_url.$url);
}
$base_url .= $parts['host'];
if (isset($parts['path'])){
    $base_url .= $parts['path'];
}

然后将其与您的代码结合使用,如下所示:

$html = file_get_contents("any site");
$dom = new domDocument;
@$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');

foreach ($images as $image) {
   echo $base_url.$image->getAttribute('src');
}

答案 1 :(得分:1)

此代码区分具有相对 URL的src属性和完整 URL。它比简单的字符串连接更健壮,并处理相对路径不以斜杠开头的情况。 例如 images/image.jpg/images/image.jpg

<?php
$site = 'http://example.com/some/deeply/buried/page.html';
$dir = dirname($site);

$html = file_get_contents($site);
$dom = new domDocument;
@$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');

foreach ($images as $image) {
    // get the img src attribute
    $img_path = $image->getAttribute('src');

    // parse the path into its constituent parts
    $url_info = parse_url($img_path);

    // if the host part (or indeed any part other than "path") is set,
    // then we're dealing with a fully qualified URL (or possibly an error)
    if (!isset($url_info['host'])) {
        // otherwise, get the relative path
        $path = $url_info['path'];

        // and ensure it begins with a slash
        if (substr($path,0,1) !== '/') $path = '/'.$path;

        // concatenate the site directory with the relative path
        $img_path = $dir.$path;
    }

   echo $img_path;  // this should be a full URL
}
?>

答案 2 :(得分:1)

它为我工作,也尝试一下

<?php
  echo path_to_absolute(
    "../images/example.jpg", /* image url */
    "http://php.net/manual/en/" /* current page url */,
    false /* is your url containing file name at the end like "http://server.com/file.html" */
  );

  function path_to_absolute( $src, $base = null, $has_filename = false ) {
    if ( $has_filename && !in_array( substr( $src, 0, 1 ), array( "?", "#" ) ) ) {
      $base = dirname( $base )."/";
    }
    else {
      $base = rtrim( $base, "/" )."/";
    }

    if ( parse_url( $src, PHP_URL_HOST ) ) {
      /* Its full url, so return it without modifying */
      return $src;
    }

    if ( substr( $src, 0, 1 ) == "/" ) {
      /* $src begin with a slash, find server host and, join it with $src */
      return str_replace( parse_url( $base, PHP_URL_PATH ), "", $base ).$src;
    }

    /* remove './' from $src, we dont need it */
    $src  = ( substr( $src, 0, 2 ) === "./" ) ? substr( $src, 2, strlen( $src ) ) : $src;

    /* check how many times we need to go back **/
    $path = substr_count( $src, "../" );
    $src  = str_ireplace( "../", "", $src );

    for( $i = 1; $i <= $path; $i++ ) {
      if ( parse_url( dirname( $base ), PHP_URL_HOST ) ) {
        $base = dirname( $base ) . "/";
      }
    }

    return $base . $src;
  }
?>

示例用法..
在这里我们找到php.net的链接,因为有很多相对链接

<?php
  $url  = "http://www.php.net/manual/en/tokens.php";
  $html = file_get_contents( $url );
  $dom  = new DOMDocument;
  @$dom->loadHTML( $html );
  $dom->preserveWhiteSpace  = false;

  $links  = $dom->getElementsByTagName( 'a' );

  foreach( $links as $link ) {
    $original_url = $link->getAttribute( 'href' );
    $absolute_url = path_to_absolute( $original_url, $url, true );
    echo $original_url." - ".$absolute_url."\n";
  }

  /** prints...
   * / - http://www.php.net/
   * ...
   * control-structures.while.php     - http://www.php.net/manual/en/control-structures.while.php
   * control-structures.do.while.php  - http://www.php.net/manual/en/control-structures.do.while.php
   * ...
   * /sitemap.php - http://www.php.net/sitemap.php
   * /contact.php - http://www.php.net/contact.php
   * ...
   * http://developer.yahoo.com/ - http://developer.yahoo.com/
   * ...
   * ?setbeta=1&beta=1 - http://www.php.net/manual/en/tokens.php?setbeta=1&beta=1
   * ...
   * #85872 - http://www.php.net/manual/en/tokens.php#85872
   **/
?>

答案 3 :(得分:0)

我认为您应该将第二个解决方案与'any site'的网址结合起来。因为图像的src标记可能只包含相对路径。从Web开发人员的角度来看,不需要包含绝对路径。