php DOM从特定表中提取链接

时间:2017-09-10 17:19:39

标签: php

在我的代码中,我想从我的旧网站中提取所有链接及其文本我已成功完成但问题是在某处我使用了ol>li标记,而在某处我使用了ul>li标记表格和我有大约400个不同的页面,我可以提取所有链接,但我每次都必须将ol更改为ul,这样我才能最简单省时地从所有页面中提取链接及其文本是定义包含链接的特定<table>但是当我定义<table>时,它还从其他表中提取所有其他我不想要的链接

包含ol>liul>li标记的目标结构

<table style="width:850px;" cellspacing="0" cellpadding="1" border="3">
    <tbody>
        <tr>
        <td style="text-align: center; background-color: rgb(51, 51, 204);">
            <h1>My Links</h1>
        </td>
        </tr>
        <tr>
            <td>
                <ol>
                    <li><a href="http://websitelink.com/page1.php">Page 1</a></li>
                    <li><a href="http://websitelink.com/page2.php">Page 2</a></li>
                    <li><a href="http://websitelink.com/page3.php">Page 3</a></li>
                    <li><a href="http://websitelink.com/page4.php">Page 4</a></li>
                </ol>
                ...
                <ul>
                    <li><a href="http://websitelink.com/a.php">Page A</a></li>
                    <li><a href="http://websitelink.com/b.php">Page B</a></li>
                    <li><a href="http://websitelink.com/c.php">Page C</a></li>
                    <li><a href="http://websitelink.com/d.php">Page D</a></li>
                </ul>
            </td>
        </tr>
    </tbody>
</table>

我当前的PHP代码

$html = file_get_contents('http://mywebsitelink.com/pagename.html');
$dom = new DOMDocument;
@$dom->loadHTML($html);
$oltags = $dom->getElementsByTagName('ol'); // I have to change between ul and ol instead of this I can define table

foreach ($oltags as $list){
    $links =  $list->getElementsByTagName('a');
    foreach ($links as $href){
    $text = $href->nodeValue;
    $href = $href->getAttribute('href');
    if(!empty($text) && !empty($href)) {
    echo "Link Title:     " . $text . "       Location:     " . $href . "<br />";
    }
    }

}

2 个答案:

答案 0 :(得分:1)

你可以尝试这个。在这里,我们使用DOMDocument并对DOMXPath中存在的anchors进行li查询

  

XPath查询//table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a此处我们正在使用//table/tbody/tr/td/ol/li/a运营商搜索//table/tbody/tr/td/ul/li/a|

Try this code snippet here

$links=array();
$domDocument = new DOMDocument();
$domDocument->loadHTML($string);

$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query("//table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a"); //querying domdocument
foreach($results as $result)
{
    $links[]=$result->getAttribute("href");//gathering href attribute
}
print_r($links);

答案 1 :(得分:0)

$html = file_get_contents('http://mywebsitelink.com/pagename.html');
$dom = new DOMDocument;
@$dom->loadHTML($html);

$xpath = new DOMXpath($dom);

$thetags = $xpath->query('//table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a');

foreach($thetags as $onetag)
{
    $links =  $onetag->getElementsByTagName('a');

    foreach ($links as $onelink){
        $text = $onelink->nodeValue;
        $href = $onelink->getAttribute('href');
        if(!empty($text) && !empty($href)) {
            echo "Link Title:     " . $text . "       Location:     " . $href . "<br />";
        }
    }
}
[...]