Question

我想使用选择器提取网站某些部分中包含的内容。我使用Simple HTML DOM来执行此操作。但是由于某种原因，返回的数据多于我指定的选择器中的数据。我检查了FAQ of Simple HTML DOM，但没有看到任何可以帮助我的东西。我也无法在Stackoverflow上找到任何东西。

我正在尝试获取 ul class =＆＃34; river＆＃34;中包含的所有 h2 class =＆＃34; hed＆＃34; 标签的内容/ href 在此网页上：http://www.theatlantic.com/most-popular/

在我的输出中，我收到了来自其他标签的大量数据，例如 p class =＆＃34; dek has-dek＆＃34; ，它们不包含在h2标签中，不应该是包括在内。这真的很奇怪，因为我认为代码只允许抓取这些标签中的内容。

如何将输出限制为仅包含h2标记中包含的数据？

以下是我正在使用的代码：

<div class='rcorners1'>
<?php
include_once('simple_html_dom.php');

$target_url = "http://www.theatlantic.com/most-popular/";

$html = new simple_html_dom();

$html->load_file($target_url);

$posts = $html->find('ul[class=river]');
$limit = 10;
$limit = count($posts) < $limit ? count($posts) : $limit;
for($i=0; $i < $limit; $i++){
  $post = $posts[$i];
  $post->find('h2[class=hed]',0)->outertext = "";
  echo strip_tags($post, '<p><a>');
  }
  ?>
  </div>

Output can be seen here。我只获得了作者的信息，文章的相关信息，而不仅仅是几篇文章链接。

Answer 1

您没有输出h2内容，而是ul中的echo内容：

echo strip_tags($post, '<p><a>');

请注意，echo之前的语句不会修改 $ post ：

$post->find('h2[class=hed]',0)->outertext = "";

将代码更改为：

$hed = $post->find('h2[class=hed]',0);
echo strip_tags($hed, '<p><a>');

但是，这只会对首次找到的h2执行某些操作。所以你需要另一个循环。以下是load_file之后的代码重写：

$posts = $html->find('ul[class=river]');
foreach($posts as $postNum => $post) {
    if ($postNum >= 10) break; // limit reached
    $heds = $post->find('h2[class=hed]');
    foreach($heds as $hed) {
        echo strip_tags($hed, '<p><a>');
    }
}

如果您仍需要清除outertext，则可以使用 $ hed 执行此操作：

$hed->outertext = "";

Answer 2

你真的只需要一个循环。考虑一下：

foreach($html->find('ul.river > h2.hed') as $postNum => $h2) {
  if ($postNum >= 10) break;
  echo strip_tags($h2, '<p><a>') . "\n"; // the text
  echo $h2->parent->href . "\n"; // the href
}

简单的HTML Dom Crawler返回的内容多于属性

2 个答案: