Question

我正在尝试使用DOMXPath查询方法来抓取a website。我已成功从此页面中删除了每个新闻主播的20个个人资料网址。

$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n){
    $value = $n->nodeValue;
    $profileurl[] = $value;

    }

我使用生成的数组作为URL来从每个News Anchor的生物页面中抓取数据。

$imgurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//img[@class='photo fn']/@src");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $imgurl[] = $value;
        }
    }

每个新闻主播个人资料页面都有6个xPaths我需要抓取（$ imgurl数组就是其中之一）。然后我将这些已删除的数据发送到MySQL。

到目前为止，当我尝试从每个配置文件中获取Twitter URL时，一切都很有效 - 除了，因为在每个新闻主播配置文件页面上都找不到此元素。这导致MySQL接收5列，其中包含20行全行和1列（twitterurl），其中包含18行数据。这18行没有正确排列其他数据，因为如果xPath不存在，它似乎被跳过。

如何解释丢失的xPath？寻找答案，我发现某人的声明说，＆＃34; nodeValue永远不能为null，因为没有值，节点就不会存在。＆＃34;考虑到这一点，如果没有nodeValue，我怎样才能以编程方式识别这些xPath何时不存在，并在循环到下一次迭代之前用其他一些默认值填充该迭代？

这里是Twitter网址的查询：

$twitterurl = array();
    for($z=0;$z<$elementCount;$z++){
        $html = new DOMDocument();
        @$html->loadHtmlFile($profileurl[$z]);
        $xpath = new DOMXPath($html);
        $nodelist = $xpath->query("//*[@id='bio']/div[2]/p[3]/a/@href");

        foreach($nodelist as $n){
            $value = $n->nodeValue;
            $twitterurl[] = $value;
        }
    }

Answer 1

由于twitter节点出现零次或一次，因此将foreach更改为

$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;

这将使内容保持同步。但是，您必须安排在用于将它们插入数据库的查询中处理NULL值。

Answer 2

我认为您在抓取数据的方式上存在多个问题，并会尝试在我的回答中概述这些问题，希望它总能澄清您的核心问题：

我发现某人的声明说：“nodeValue永远不能为null，因为没有值，节点就不会存在。”考虑到这一点，如果没有nodeValue，我怎样才能以编程方式识别这些xPath何时不存在，并在循环到下一次迭代之前用其他一些默认值填充该迭代？

首先收集每个配置文件（详细信息）页面的URL是一个好主意。通过将其放入您的抓取工作的整体环境中，您甚至可以从中获益更多：

* profile pages
     `- profile page
          +- name
          +- role
          +- img
          +- email
          +- facebook
          `- twitter

这是您想要获取的数据的结构。您已设法获取所有个人资料页面网址：

$url   = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";

$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath    = new DOMXPath($html);
$nodelist = $xpath->query($xPath);

$profileurl = array();
foreach ($nodelist as $n) {
    $value        = $n->nodeValue;
    $profileurl[] = $value;
}

如您所知，接下来的步骤是加载和查询20多个配置文件页面，您可以做的第一件事就是提取代码中创建 DOMXPath 从URL到它自己的功能。这也可以让您轻松地进行更好的错误处理：

/** * @param string $url * * @throws RuntimeException * @return DOMXPath */ function xpath_from_url($url) { $html = new DOMDocument(); $saved = libxml_use_internal_errors(true); $result = $html->loadHtmlFile($url); libxml_use_internal_errors($saved); if (!$result) { throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url)); } $xpath = new DOMXPath($html); return $xpath; }

这会将主处理更改为更加压缩的形式，然后才将代码提取（移动）到xpath_from_url函数中：

$xpath = xpath_from_url($url); $nodelist = $xpath->query($xPath); $profileurl = array(); foreach ($nodelist as $n) { $value = $n->nodeValue; $profileurl[] = $value; }

但它也允许您对代码进行另一次更改：您现在可以直接在主提取例程的结构中处理URL：

$url = "http://www.sandiego6.com/about-us/meet-our-team"; $xpath = xpath_from_url($url); $profileUrls = $xpath->query("//p[@class='bio']/a/@href"); foreach ($profileUrls as $profileUrl) { $profile = xpath_from_url($profileUrl->nodeValue); // ... extract the six (inkl. optional) values from a profile }

如您所见，此代码跳过创建配置文件URL数组，因为第一个xpath操作已经给出了所有配置文件URL的集合。

现在缺少从详细信息页面中提取最多六个字段的部分。使用这种迭代配置文件URL的新方法，这非常容易管理 - 只需为每个字段创建一个xpath表达式并获取数据。如果您使用DOMXPath::evaluate而不是DOMXPath::query，则可以直接获取字符串值。不存在的节点的字符串值是空字符串。这不是真的测试节点是否存在，如果你需要NULL而不是“”（空字符串），这需要以不同的方式完成（我也可以证明这一点，但这不是正确的观点）现在）。在以下示例中，正在提取锚点名称和角色：

foreach ($profileUrls as $i => $profileUrl) { $profile = xpath_from_url($profileUrl->nodeValue); printf( "#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[@class="entry-title"])'), $profile->evaluate('normalize-space(//h2[@class="fn"])') ); // ... extract the other four (inkl. optional) values from a profile }

我选择直接输出值（而不关心将它们添加到数组或类似的结构中），以便很容易理解发生的事情：

#01: Marc Bailey (Morning Anchor) #02: Heather Myers (Morning Anchor) #03: Jim Patton (10pm Anchor) #04: Neda Iranpour (10 p.m. Anchor / Reporter) ...

获取有关电子邮件，Facebook和Twitter的详细信息：

foreach ($profileUrls as $i => $profileUrl) { $profile = xpath_from_url($profileUrl->nodeValue); printf( "#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[@class="entry-title"])'), $profile->evaluate('normalize-space(//h2[@class="fn"])') ); printf( " email...: %s\n", $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")') ); printf( " facebook: %s\n", $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)') ); printf( " twitter.: %s\n", $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)') ); }

现在已经根据需要输出了数据（我已将图像保留了，因为这些图像在文本模式下无法很好地显示：

#01: Marc Bailey (Morning Anchor) email...: m.bailey@sandiego6.com facebook: https://www.facebook.com/marc.baileySD6 twitter.: http://www.twitter.com/MarcBaileySD6 #02: Heather Myers (Morning Anchor) email...: heather.myers@sandiego6.com facebook: https://www.facebook.com/heather.myersSD6 twitter.: http://www.twitter.com/HeatherMyersSD6 #03: Jim Patton (10pm Anchor) email...: jim.patton@sandiego6.com facebook: https://www.facebook.com/Jim.PattonSD6 twitter.: http://www.twitter.com/JimPattonSD6 #04: Neda Iranpour (10 p.m. Anchor / Reporter) email...: Neda.Iranpour@sandiego6.com facebook: https://www.facebook.com/lightenupwithneda twitter.: http://www.twitter.com/@LightenUpWNeda ...

所以现在这些带有一个foreach循环的小代码行已经很好地代表了原始结构：

* profile pages `- profile page +- name +- role +- img +- email +- facebook `- twitter

您所要做的就是遵循代码中数据可用方式的整体结构。然后，当您看到所有数据都可以按照希望获得时，您可以在数据库中执行存储操作：每个配置文件一次插入。每个配置文件一行。您不必保留整个数据，只需插入（可能需要检查一下是否已存在）每行的数据。

希望有所帮助。

附录：完整的代码

<?php /** * Scraping detail pages based on index page */ /** * @param string $url * * @throws RuntimeException * @return DOMXPath */ function xpath_from_url($url) { $html = new DOMDocument(); $saved = libxml_use_internal_errors(true); $result = $html->loadHtmlFile($url); libxml_use_internal_errors($saved); if (!$result) { throw new RuntimeException(sprintf('Failed to load HTML from "%s"', $url)); } $xpath = new DOMXPath($html); return $xpath; } $url = "http://www.sandiego6.com/about-us/meet-our-team"; $xpath = xpath_from_url($url); $profileUrls = $xpath->query("//p[@class='bio']/a/@href"); foreach ($profileUrls as $i => $profileUrl) { $profile = xpath_from_url($profileUrl->nodeValue); printf( "#%02d: %s (%s)\n", $i + 1, $profile->evaluate('normalize-space(//h1[@class="entry-title"])'), $profile->evaluate('normalize-space(//h2[@class="fn"])') ); printf(" email...: %s\n", $profile->evaluate('substring-after(//*[@class="bio-email"]/a/@href, ":")')); printf(" facebook: %s\n", $profile->evaluate('string(//*[@class="bio-facebook url"]/a/@href)')); printf(" twitter.: %s\n", $profile->evaluate('string(//*[@class="bio-twitter url"]/a/@href)')); }

在使用DOMXPath查询方法抓取网站时，如何解释丢失的xPath并保持我的数据统一？

2 个答案: