我确实有html文件这只是它的一部分...
<div id="result" >
<div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/gigablast.com.png"
alt="favicon for gigablast.com"
width=16
height=16
/>
<a
href="http://www.gigablast.com/"
rel="nofollow"
>
Gigablast
</a>
<div class="res_main">
<h2 class="res_main_top">
<img
src="/ff/ask.com.png"
alt="favicon for ask.com"
width=16
height=16
/>
<a
href="http://ask.com/" rel="nofollow"
>
Ask.com - What's Your Question?
</a>....
我想要提取唯一的网址(例如:http://www.gigablast.com
和http://ask.com/
- 来自上面的至少10个网址)使用PHP Dom Document ..我知道这一点但不知道怎么前进?
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$data = $doc->getElementById('result');
那么什么?这是内部标签因此我不能在这里使用$data->getElementsByTagName()
!!
答案 0 :(得分:0)
您可以在DOMElement对象上调用getElementsByTagName
:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');
$urls = array();
foreach ($anchors as $a) {
$urls[] = $a->getAttribute('href');
}
如果你想获得图片来源,也很容易添加。
答案 1 :(得分:0)
如果您只想提取文档中所有href
标记的a
属性(并且<div id="result">
无关紧要,可以使用以下内容:
$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
$urls[] = $anchor->attributes->href;
}
// $urls is your collection of urls in the original document.
答案 2 :(得分:0)
使用XPath将字段缩小到a
元素内的<div class="res_main">
个元素:
$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);
$query = '//div[@class="res_main"]//a';
$nodes = $xpath->query($query);
$urls = array();
foreach ($nodes as $node) {
$href = $node->getAttribute('href');
if (!empty($href)) {
$urls[] = $href;
}
}
这解决了拾取文档中所有<a>
元素的问题,因为它允许您只过滤您想要的元素(因为您不关心导航链接等)...