Question

我的html页面结构如下

<div id="1">
  <div id="2">
    <div id="3">
      <div id="4">
        <div id="5">   
          <div id="photo">    
            <a id="photo" href="link">
              <img width="200" src="http://site.com/photo.jpg"> 
            </a> 
          </div>
          <div id="info"></div>
        </div>
      </div> 
    </div> 
  </div> 
</div>

我需要获取img url（http://site.com/...）

我的代码：

include('simple_html_dom.php');

// Create a DOM object from a URL
$html = file_get_html('http://site.com/123');


// find all div tags with id=gbar
foreach($html->find('img[width="200"]') as $e)
    echo $e->src . '<br>';

但它不适用于此网站。
可能还有另一种获取图片网址的方法

Answer 1

在$html->find('img[width=200]')附近，如果没有额外引号，可能应为200。

Answer 2

正如预期的那样，网站根据User-Agent提供不同的内容，以获取您期望所需的HTML，让服务器知道您想要“for browsers”版本。例如，您可以删除此行：

$html = file_get_html('http://vk.com/durov');

...并用以下内容替换它：

$context = stream_context_create(array('http' => array(
  'header' => 'User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'
)));
$html = str_get_html( file_get_contents('http://vk.com/durov', false, $context) );

我应该注意，欺骗用户代理的做法通常是不受欢迎的，你应该运行它以查看包含的信息是否符合您的需求：

<?php
  header('Content-type: text/plain');
  echo file_get_contents('http://siteurl.com');

将显示网站希望机器人看到的源代码 - 对于相关网站而言，这是页面的轻量级版本 - 从您的角度来看，这需要更少的时间来处理。

Answer 3

您可以使用正则表达式来查找它，例如：

<?php 
$string = '
<div id="1">
  <div id="2">
    <div id="3">
      <div id="4">
        <div id="5">   
          <div id="photo">    
            <a id="photo" href="link">
              <img width="200" src="http://site.com/photo.jpg"> 
            </a> 
          </div>
          <div id="info"></div>
        </div>
      </div> 
    </div> 
  </div> 
</div> ';

$pattern = '/http[^""]+/';
preg_match($pattern, $string, $matches);
print_r($matches);

打印：

Array
(
    [0] => http://site.com/photo.jpg
)

如何使用PHP Simple HTML DOM获取部分html

3 个答案: