If I have a webpage like this:
<body>
<header>
<a href='http://domain1.com'>link 1 text</a>
</header>
<a href='http://domain2.com'>link 2 text</a>
<footer>
<a href='http://domain3.com'>link 3 text</a>
</footer>
</body>
How do I pull the <a>
tags out of the <body>
but exclude the links from <header>
and <footer>
?
In the real web page, there will be a lot of <a>
tags in the <header>
so I'd rather not have to cycle through ALL of them.
I want to pull out the URLs and anchor text from each of the <a>
tags that are NOT inside the <header>
or <footer>
tags.
EDIT: this is how I find links in the header:
$header = $html->find('header',0);
foreach ($header->find('a') as $a){
do something
}
I would like to do this (note the use of "!")
$foo = $html->find('!header,!footer');
foreach ($foo->find('a') as $a){
do something
}
答案 0 :(得分:1)
在查找链接之前,从正在使用的DOM中删除页眉和页脚。
<?php
include("simple_html_dom.php");
$source = <<<EOD
<body>
<header>
<a href='http://domain1.com'>link 1 text</a>
</header>
<a href='http://domain2.com'>link 2 text</a>
<a href='http://domain4.com'>link 4 text</a>
<footer>
<a href='http://domain3.com'>link 3 text</a>
</footer>
</body>
EOD;
$html = str_get_html($source);
foreach ($html->find('header, footer') as $unwanted) {
$unwanted->outertext = "";
}
$html->load($html->save());
$links = $html->find("a");
foreach ($links as $link) {
print $link;
};
?>
答案 1 :(得分:0)
不破坏身体?你可以这样做:
$bad_as = $html->find('header a, footer a');
foreach($html->find('a') as $a){
if(in_array($a, $bad_as)) continue;
// do something
}
答案 2 :(得分:-1)
简单的html-dom是不可能的,当然这很简单。 你不能用simple-html-dom来做到这一点。
$html->find('body > a');
此Css选择器选择父级为<a>
元素的所有<body>
个元素
您需要遍历body的子节点,然后获取<a>
我建议查看How do you parse and process HTML/XML in PHP?
就我而言,我使用Symfony / DomCrawler和Symfony / CssSelector来做这件事。