Question

我确实有html文件这只是它的一部分...

<div id="result" >
    <div class="res_item" id="1" h="63c2c439b62a096eb3387f88465d36d0">
        <div class="res_main">
            <h2 class="res_main_top">
                <img 
                    src="/ff/gigablast.com.png" 
                    alt="favicon for gigablast.com" 
                    width=16 
                    height=16
                    />&nbsp;
                <a 
                    href="http://www.gigablast.com/" 
                    rel="nofollow"
                    >
                    Gigablast
                </a>
                <div class="res_main">
                    <h2 class="res_main_top">
                        <img 
                            src="/ff/ask.com.png" 
                            alt="favicon for ask.com" 
                            width=16 
                            height=16
                            />&nbsp;
                        <a 
                            href="http://ask.com/" rel="nofollow"
                            >
                            Ask.com - What&#039;s Your Question?
                        </a>....

我想要提取唯一的网址（例如：http://www.gigablast.com和http://ask.com/ - 来自上面的至少10个网址）使用PHP Dom Document ..我知道这一点但不知道怎么前进？

$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');

$data = $doc->getElementById('result');

那么什么？这是内部标签因此我不能在这里使用$data->getElementsByTagName() !!

Answer 1

您可以在DOMElement对象上调用getElementsByTagName：

$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');

$result = $doc->getElementById('result');
$anchors = $result->getElementsByTagName('a');

$urls = array();
foreach ($anchors as $a) {
    $urls[] = $a->getAttribute('href');
}

如果你想获得图片来源，也很容易添加。

Answer 2

如果您只想提取文档中所有href标记的a属性（并且<div id="result">无关紧要，可以使用以下内容：

$doc = new DomDocument;
$doc->loadHTMLFile('urllist.html');
$anchors = $doc->getElementsByTagName('a');
$urls = array();
foreach($anchors as $anchor) {
    $urls[] = $anchor->attributes->href;
}

// $urls is your collection of urls in the original document.

Answer 3

使用XPath将字段缩小到a元素内的<div class="res_main">个元素：

$doc = new DomDocument();
$doc->loadHTMLFile('urllist.html');
$xpath = new DomXpath($doc);

$query = '//div[@class="res_main"]//a';
$nodes = $xpath->query($query);

$urls = array();

foreach ($nodes as $node) {
    $href = $node->getAttribute('href');
    if (!empty($href)) {
        $urls[] = $href;
    }
}

这解决了拾取文档中所有<a>元素的问题，因为它允许您只过滤您想要的元素（因为您不关心导航链接等）...

如何使用PHP Dom提取此值

3 个答案: