我想创建一个页面,其中所有驻留在我网站上的图像都列有标题和替代表示。
我已经给我写了一个程序来查找和加载所有HTML文件,但现在我不知道如何从这个HTML中提取src
,title
和alt
:
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
我想这应该用一些正则表达式完成,但由于标签的顺序可能会有所不同,而且我需要所有这些,我真的不知道如何以优雅的方式解析它(我可以做到通过char方式的硬炭,但这很痛苦。)
答案 0 :(得分:241)
$url="http://example.com";
$html = file_get_contents($url);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}
答案 1 :(得分:191)
使用regexp来解决这类问题的问题是a bad idea,并且很可能会导致无法维护和不可靠的代码。更好地使用HTML parser。
在这种情况下,最好将流程分为两部分:
我将假设您的doc不是xHTML严格的,因此您不能使用XML解析器。例如。使用此网页源代码:
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/<img[^>]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[...]
)
)
然后我们用循环获取所有img标记属性:
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
Regexp是CPU密集型的,因此您可能希望缓存此页面。如果您没有缓存系统,可以使用ob_start并从文本文件加载/保存来调整自己的缓存系统。
首先,我们使用preg_ match_ all,这个函数可以获得与模式匹配的每个字符串,并将其输出到第三个参数中。
正则表达式:
<img[^>]+>
我们将其应用于所有html网页。它可以被读作以“<img
”开头的每个字符串,包含非“&gt;” char并以&gt; 结束。
(alt|title|src)=("[^"]*")
我们在每个img标签上连续应用它。它可以被读作以“alt”,“title”或“src”开头的每个字符串,然后是“=”,然后是“”,一堆不是'“的字符串,以'''。在()之间隔离子字符串。
最后,每次你想要处理regexp时,都可以使用好的工具来快速测试它们。请检查此online regexp tester。
编辑:回答第一条评论。
确实,我没有想到(希望很少)人使用单引号。
好吧,如果你只使用',只需替换所有“by”。
如果你混合两者。首先你应该打自己:-),然后尝试使用(“|')代替或”和[^ø]来代替[^“]。
答案 2 :(得分:64)
只是举一个使用PHP的XML功能来完成任务的小例子:
$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}
我确实使用了DOMDocument::loadHTML()
方法,因为此方法可以处理HTML语法,并且不会强制输入文档为XHTML。严格地说,转换为SimpleXMLElement
不是必需的 - 它只是使用xpath并且xpath结果更简单。
答案 3 :(得分:8)
如果它是XHTML,那么你的例子是,你只需要simpleXML。
<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>
输出:
object(SimpleXMLElement)#1 (1) {
["@attributes"]=>
array(3) {
["src"]=>
string(22) "/image/fluffybunny.jpg"
["title"]=>
string(16) "Harvey the bunny"
["alt"]=>
string(26) "a cute little fluffy bunny"
}
}
答案 4 :(得分:5)
必须像这样编辑脚本
foreach( $result[0] as $img_tag)
因为preg_match_all返回数组数组
答案 5 :(得分:5)
您可以使用simplehtmldom。 simplehtmldom支持大多数jQuery选择器。下面给出一个例子
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
答案 6 :(得分:3)
我用preg_match来做。
在我的情况下,我有一个字符串,其中包含我从Wordpress获得的一个<img>
标记(并且没有其他标记),我试图获取src
属性,以便我可以运行它timthumb。
// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);
// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);
在获取标题或替代品的模式中,您只需使用$pattern = '/title="([^"]*)"/';
来获取标题,或使用$pattern = '/title="([^"]*)"/';
来获取替代品。可悲的是,我的正则表达式并不足以通过一次传递来获取所有三个(alt / title / src)。
答案 7 :(得分:1)
这是一个PHP函数我为了类似的目的而从所有上述信息中蹒跚而行,即动态调整图像标签的宽度和长度属性...有点笨重,或许,但似乎可靠地工作:
function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) {
// find image tags
preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER);
// put image tags in a simpler array
$imagearray = array();
for ($i = 0; $i < count($rawimagearray); $i++) {
array_push($imagearray, $rawimagearray[$i][0]);
}
// put image attributes in another array
$imageinfo = array();
foreach($imagearray as $img_tag) {
preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]);
}
// combine everything into one array
$AllImageInfo = array();
foreach($imagearray as $img_tag) {
$ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]);
$OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]);
$OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]);
$NewWidth = $OrignialWidth;
$NewHeight = $OrignialHeight;
$AdjustDimensions = "F";
if($OrignialWidth > $MaximumWidth) {
$diff = $OrignialWidth-$MaximumHeight;
$percnt_reduced = (($diff/$OrignialWidth)*100);
$NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100));
$NewWidth = floor($OrignialWidth-$diff);
$AdjustDimensions = "T";
}
if($OrignialHeight > $MaximumHeight) {
$diff = $OrignialHeight-$MaximumWidth;
$percnt_reduced = (($diff/$OrignialHeight)*100);
$NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100));
$NewHeight= floor($OrignialHeight-$diff);
$AdjustDimensions = "T";
}
$thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions);
array_push($AllImageInfo, $thisImageInfo);
}
// build array of before and after tags
$ImageBeforeAndAfter = array();
for ($i = 0; $i < count($AllImageInfo); $i++) {
if($AllImageInfo[$i]['AdjustDimensions'] == "T") {
$NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']);
$NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag);
$thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag);
array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter);
}
}
// execute search and replace
for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) {
$HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent);
}
return $HTMLContent;
}
答案 8 :(得分:0)
以下是PHP中的解决方案:
只需下载QueryPath,然后执行以下操作:
$doc= qp($myHtmlDoc);
foreach($doc->xpath('//img') as $img) {
$src= $img->attr('src');
$title= $img->attr('title');
$alt= $img->attr('alt');
}
就是这样,你已经完成了!
答案 9 :(得分:0)
我已经阅读了本页上的许多评论,这些评论抱怨说使用dom解析器是不必要的开销。嗯,这可能比仅进行正则表达式调用更为昂贵,但是OP表示无法控制img标签中属性的顺序。这个事实导致不必要的正则表达式模式卷积。除此之外,使用dom解析器还具有可读性,可维护性和dom意识(正则表达式不是dom意识)的其他好处。
我喜欢正则表达式,并且回答了很多正则表达式问题,但是当处理有效的HTML时,很少有理由要在解析器上进行正则表达式。
在下面的演示中,了解如何简单,干净地使用DOMDocument以引号(和根本不引号)的混合顺序处理img标签属性。还要注意,没有目标属性的标记根本不会破坏-提供一个空字符串作为值。
代码:(Demo)
$test = <<<HTML
<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />
<img src='/image/pricklycactus.jpg' title='Roger the cactus' alt='a big green prickly cactus' />
<p>This is irrelevant text.</p>
<img alt="an annoying white cockatoo" title="Polly the cockatoo" src="/image/noisycockatoo.jpg">
<img title=something src=somethingelse>
HTML;
libxml_use_internal_errors(true); // silences/forgives complaints from the parser (remove to see what is generated)
$dom = new DOMDocument();
$dom->loadHTML($test);
foreach ($dom->getElementsByTagName('img') as $i => $img) {
echo "IMG#{$i}:\n";
echo "\tsrc = " , $img->getAttribute('src') , "\n";
echo "\ttitle = " , $img->getAttribute('title') , "\n";
echo "\talt = " , $img->getAttribute('alt') , "\n";
echo "---\n";
}
输出:
IMG#0:
src = /image/fluffybunny.jpg
title = Harvey the bunny
alt = a cute little fluffy bunny
---
IMG#1:
src = /image/pricklycactus.jpg
title = Roger the cactus
alt = a big green prickly cactus
---
IMG#2:
src = /image/noisycockatoo.jpg
title = Polly the cockatoo
alt = an annoying white cockatoo
---
IMG#3:
src = somethingelse
title = something
alt =
---
在专业代码中使用此技术将使您的脚本干净整洁,可以避免打h,希望您在其他地方工作的同事也更少。