遵循Stackoverflow的优点和建议,我不使用正则表达式来解析HTML。相反,我正在使用QueryPath,但我得到了意想不到的结果,我无法理解。
首先以此HTML片段为例。
<div class="allLinks">
<h2>Headline 1</h2>
<ul>
<li class="clearfix">
<ul class="cat"><li>Category 1</li></ul>
<a href="/some-link/">Link Title 1</a>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
</li>
</ul>
<h2>Headline 2</h2>
<ul>
<li class="clearfix">
<ul class="cat"><li>Category 1</li><li>Category 2</li></ul>
<a href="/some-link/">Link Title 2</a>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
</li>
</ul>
<h2>Headline 3</h2>
<ul>
<li class="clearfix">
<ul class="cat"><li>Category 2</li></ul>
<a href="/some-link/">Link Title 3</a>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
<br /><span><img src="//someimage.gif" alt="Link Info" />
<a href="http://a-link" rel="nofollow">Link Source</a></span>
</li>
</ul>
</div>
我想从中提取的是链接标题(将是每个
以下是我接近它的方式
$qp = htmlqp($url, 'div.allLinks'); // Load the fragment from the HTML
foreach($qp->find('ul') as $items) { // Loop through the UL elements
$title = $items->find('li>a')->text(); // Find <a> elements that are directly under the <li>
foreach($items->find('span') as $links) { // Loop through all the <spans>
$link_info = $links->find('img')->attr('alt'); // Get the alt text value
$link = $links->find('a')->attr('href'); // Get the link
$source = $links->find('a')->text(); // Get the anchor text
}
}
然而,只要该项目在无序列表中有多个类别并且将标题返回为“链接标题2链接标题3”(由于某种原因将它们连接在一起),这似乎就会变得混乱。