在我的代码中,我想从我的旧网站中提取所有链接及其文本我已成功完成但问题是在某处我使用了ol>li
标记,而在某处我使用了ul>li
标记表格和我有大约400个不同的页面,我可以提取所有链接,但我每次都必须将ol
更改为ul
,这样我才能最简单省时地从所有页面中提取链接及其文本是定义包含链接的特定<table>
但是当我定义<table>
时,它还从其他表中提取所有其他我不想要的链接
包含ol>li
或ul>li
标记的目标结构
<table style="width:850px;" cellspacing="0" cellpadding="1" border="3">
<tbody>
<tr>
<td style="text-align: center; background-color: rgb(51, 51, 204);">
<h1>My Links</h1>
</td>
</tr>
<tr>
<td>
<ol>
<li><a href="http://websitelink.com/page1.php">Page 1</a></li>
<li><a href="http://websitelink.com/page2.php">Page 2</a></li>
<li><a href="http://websitelink.com/page3.php">Page 3</a></li>
<li><a href="http://websitelink.com/page4.php">Page 4</a></li>
</ol>
...
<ul>
<li><a href="http://websitelink.com/a.php">Page A</a></li>
<li><a href="http://websitelink.com/b.php">Page B</a></li>
<li><a href="http://websitelink.com/c.php">Page C</a></li>
<li><a href="http://websitelink.com/d.php">Page D</a></li>
</ul>
</td>
</tr>
</tbody>
</table>
我当前的PHP代码
$html = file_get_contents('http://mywebsitelink.com/pagename.html');
$dom = new DOMDocument;
@$dom->loadHTML($html);
$oltags = $dom->getElementsByTagName('ol'); // I have to change between ul and ol instead of this I can define table
foreach ($oltags as $list){
$links = $list->getElementsByTagName('a');
foreach ($links as $href){
$text = $href->nodeValue;
$href = $href->getAttribute('href');
if(!empty($text) && !empty($href)) {
echo "Link Title: " . $text . " Location: " . $href . "<br />";
}
}
}
答案 0 :(得分:1)
你可以尝试这个。在这里,我们使用DOMDocument
并对DOMXPath
中存在的anchors
进行li
查询
XPath
查询//table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a
此处我们正在使用//table/tbody/tr/td/ol/li/a
运营商搜索//table/tbody/tr/td/ul/li/a
或|
。
$links=array();
$domDocument = new DOMDocument();
$domDocument->loadHTML($string);
$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query("//table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a"); //querying domdocument
foreach($results as $result)
{
$links[]=$result->getAttribute("href");//gathering href attribute
}
print_r($links);
答案 1 :(得分:0)
$html = file_get_contents('http://mywebsitelink.com/pagename.html');
$dom = new DOMDocument;
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$thetags = $xpath->query('//table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a');
foreach($thetags as $onetag)
{
$links = $onetag->getElementsByTagName('a');
foreach ($links as $onelink){
$text = $onelink->nodeValue;
$href = $onelink->getAttribute('href');
if(!empty($text) && !empty($href)) {
echo "Link Title: " . $text . " Location: " . $href . "<br />";
}
}
}
[...]