我正在尝试解析网站并获取图片的名称或网址。
示例网址: http://www.theworkingmanstore.com/georgia-gr14-infants-romeo.aspx
单个<td>
中有6个或更多图片,我只想在<td>
中获得第一个img src。
我确信它可以用Dom Parser完成,但我没有经验。
任何帮助都将不胜感激。
由于
$html = file_get_contents($url);
$reg = '/img src=["\']?([^"\' ]*)["\' ]/';
preg_match_all($reg, $html, $m);
$arr = array_map(function($v){
return trim(str_replace(array('img src=', 'http://www.theworkingmanstore.com'), '', $v), '"');}, $m[0]);
print_r($arr)
输出: 这是从正则表达式输出
Array ( [0] => /images/logo2.png [1] => /images/mod_head_category_lt.gif [2] => '/images/products/display/GR14_EXTRALARGE.jpg' [3] => '/images/products/thumb/GR14_EXTRALARGE.jpg' [4] => '/images/products/thumb/GR14_8_EXTRALARGE.jpg' [5] => '/images/products/thumb/GR14_5_EXTRALARGE.jpg' [6] => '/images/products/thumb/GR14_3_EXTRALARGE.jpg' [7] => '/images/products/thumb/GR14_42_EXTRALARGE.jpg' [8] => '/images/products/thumb/GR14_2_EXTRALARGE.jpg' [9] => /images/freeshipping.jpg [10] => /images/facebook_32.png [11] => images/twitter_32.png [12] => images/googleplus_32.png [13] => images/pinterest_32.png [14] => /images/payments.gif [15] => /images/brands/the-working-man.jpg )
尝试了Dom Parser的建议:
$html = file_get_contents($url) ;
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
echo $xpath->evaluate(
'string(//td/a[@id = "Zoomer"]/descendant::img[1]/@src)'
);
输出错误: 警告:DOMDocument :: loadHTML()[domdocument.loadhtml]:实体中的标签导航无效
答案 0 :(得分:4)
在DOM中,任何东西都是节点,img
元素和src
属性。 XPath允许您从DOM中获取节点列表。
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//img/@src') as $src) {
echo $src->value, "\n";
}
输出:
http://www.theworkingmanstore.com/images/products/display/GR14_EXTRALARGE.jpg
http://www.theworkingmanstore.com/images/products/detail/GR14_EXTRALARGE.jpg
/images/products/thumb/GR14_EXTRALARGE.jpg
/images/products/thumb/GR14_8_EXTRALARGE.jpg
/images/products/thumb/GR14_5_EXTRALARGE.jpg
/images/products/thumb/GR14_3_EXTRALARGE.jpg
/images/products/thumb/GR14_42_EXTRALARGE.jpg
/images/products/thumb/GR14_2_EXTRALARGE.jpg
XPath允许退出复杂条件。以下示例输出src
内的第一个img
的{{1}}属性。
td
输出:
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//td/descendant::img[1]/@src') as $src) {
echo $src->value, "\n";
}
问题中的HTML只包含一个http://www.theworkingmanstore.com/images/products/display/GR14_EXTRALARGE.jpg
,更重要的是td
位于具有img
属性的a
元素中。所以它必须是一个独特的价值。这允许它直接在XPath中转换节点列表并将其作为字符串返回。
id
输出:
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXPath($dom);
echo $xpath->evaluate(
'string(//td/a[@id = "Zoomer"]/descendant::img[1]/@src)'
);
答案 1 :(得分:0)
您可以尝试使用此正则表达式。
$html = 'Your HTML';
$reg = '/img src=["\']?([^"\' ]*)["\' ]/';
preg_match_all($reg, $html, $m);
$arr = array_map(function($v){
return trim(str_replace(array('img src=', 'http://www.theworkingmanstore.com'), '', $v), '"');
}, $m[0]);
print '<pre>';
print_r($arr);
print '</pre>';
<强>输出:强>
Array
(
[0] => /images/products/display/GR14_EXTRALARGE.jpg
[1] => /images/products/detail/GR14_EXTRALARGE.jpg
[2] => /images/products/thumb/GR14_EXTRALARGE.jpg
[3] => /images/products/thumb/GR14_8_EXTRALARGE.jpg
[4] => /images/products/thumb/GR14_5_EXTRALARGE.jpg
[5] => /images/products/thumb/GR14_3_EXTRALARGE.jpg
[6] => /images/products/thumb/GR14_42_EXTRALARGE.jpg
[7] => /images/products/thumb/GR14_2_EXTRALARGE.jpg
)