Question

我有一个从第三方来源抓取HTML的PHP应用程序，HTML可能附带一个或多个IMG元素。我想完整地抓住第一个IMG实例，但我不确定该怎么做。

有人能把我推向正确的方向吗？

感谢。

Answer 1

您可以使用XPath来解析html，并以这种方式提取您想要的数据。它比字符串位置检查更复杂，但如果你想要一些更具体的（src和alt第一个img标签，那么它的优势在于更强大一些，例如）。

首先将html字符串加载到DOMDocument中，然后将其加载到XPath中。

// Load html in to DOMDocument, set up XPath
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);

我们希望页面上出现第一个img，因此请使用选择器/descendant::img[1]。 N.B，这与//img[1]不同，尽管这可能经常给出类似的结果。关于两者之间的差异有一个很好的解释here。

$matches = $xpath->evaluate("/descendant::img[1]");

使用XPath的一个缺点是，它不容易说“给我回复与img标签匹配的完整字符串”，因此我们可以组合一个简单的函数迭代匹配节点的属性并重新构建img标记。

$tag = "<img ";
foreach ($node->attributes as $attr) {
    $vals[] = $attr->name . '="' . $attr->value . '"';
}
$tag .= implode(" ", $vals) . " />";

把它们放在一起我们得到类似的东西：

<?php
// Example html
$html = '<html><body>'
    . ' <img src="/images/my-image.png" alt="My image" width="100" height="100" />'
    . 'Some text here <img src="do-not-want-second.jpg" alt="No thanks" />';

// Load html in to DOMDocument, set up XPath
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);

// Get the first img in the doc
// N.B. Not the same as "//img[1]" - see https://stackoverflow.com/a/453902/2287
$matches = $xpath->evaluate("/descendant::img[1]");
foreach ($matches as $match) {
    echo buildImgTag($match);
}

/**
 * Build an img tag given it's matched node
 *
 * @param DOMElement $node Img node
 *
 * @return Rebuilt img tag
 */
function buildImgTag($node) {
    $tag = "<img ";
    $vals = array();
    foreach ($node->attributes as $attr) {
        $vals[] = $attr->name . '="' . $attr->value . '"';
    }
    $tag .= implode(" ", $vals) . " />";

    return $tag;
}

```

总的来说，这比在html上使用strpos或正则表达式稍微复杂一些，但如果您决定使用img标记执行任何操作，请提供更多灵活性，例如拔出一个特定的属性。

Answer 2

如果您认为HTML是有效的HTML，则下面的示例将起作用，但我们不能假设它！如果您100％确定它是有效的HTML，那么请继续使用它，否则我建议您使用更好的方式，如下所示。

$html = '<br />First<img src="path/abc.jpg" />Next<img src="path/cde.jpg" />';

$start = stripos($html, '<img');
$extracted = substr($html, $start);
$end = stripos($extracted, '>');

echo substr($html, $start, $end+1);

此代码将为您提供：<img src="path/abc.jpg" />

使用不区分大小写的函数查找<img的第一个匹配项 stripos
从第一个出现点开始切断实际数据。
使用不区分大小写的函数查找>的第一个匹配项 stripos
提取起点和终点之间的内容 substr。

更好的方式：

PHP Simple HTML DOM Parser Manual

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images 
foreach($html->find('img') as $element) {
       echo $element->src . '<br>';
}

用PHP5 +编写的HTML DOM解析器让你可以非常好地操作HTML 简单方法！
需要PHP 5 +。
支持无效的HTML。
使用选择器在HTML页面上查找标签，就像jQuery一样。
从一行中提取HTML内容。

Answer 3

jQuery可以为你做这个。

$('img')[0]

如果它位于页面中较小的HTML子部分，请相应地调整选择器。

去除HTML块中的第一个IMG元素

3 个答案: