来自多个链接的数据抓取

时间:2014-08-29 01:15:00

标签: php html xpath web-scraping domdocument

我有一个PHP代码,它将从类名“level0 nav-1 active parent”中检索数据。有没有办法可以提供一个链接数组,并为每个循环链接数组使用略有不同的类名,而不必为类似的10个链接重复相同的代码?

像: 第一个链接(https://www.postme.com.my/men-1.html) - 使用类(“level0 nav-1活动父级”) 第二个(https://www.postme.com.my/women.html) - 使用类(“level0 nav-2活动父级”) 第三个(https://www.postme.com.my/children.html) - 使用类(“level0 nav-3活动父级”)

注意递增导航 - #?

这是php代码:

<?php
header('Content-Type: text/html; charset=utf-8');
$grep = new DoMDocument();
@$grep->loadHTMLFile("https://www.postme.com.my/men-1.html");

$finder = new DomXPath($grep);
$classCat = "level0 nav-1 active parent";

$nodesCat = $finder->query("//*[contains(@class, '$classCat')]");

$i = 0;

    foreach ($nodesCat as $node) {
    $span = $node->childNodes;
    $replace = str_replace("Items 1-12 of", "",$span->item(1)->nodeValue);

    echo $replace. " : ";
  }

  // Check another link using class name of "level0 nav-2 active parent"
  //repeat code 

  @$grep->loadHTMLFile("https://www.postme.com.my/women.html");

$finder = new DomXPath($grep);
$classCat = "level0 nav-2 active parent";

$nodesCat = $finder->query("//*[contains(@class, '$classCat')]");

$i = 0;

    foreach ($nodesCat as $node) {
    $span = $node->childNodes;
    $replace = $span->item(1)->nodeValue;

    echo $replace. " : ";
  }
//check another link with class name "level0 nav-3 active parent".
//notice the incrementing nav-#?
//I don't want to make the code long just because each link is using a slightly different class name to refer to the data.
?>

由于

1 个答案:

答案 0 :(得分:1)

我要做的是获取<li>这些链接(<ul id="nav">)的父级。然后从那里。提取值。例如:

$dom = new DOMDocument();
@$dom->loadHTMLFile('https://www.postme.com.my/men-1.html');
$xpath = new DOMXpath($dom);

$categories = $xpath->query('//ul[@id="nav"]/li');

foreach($categories as $category) {
    echo $xpath->query('./a/span', $category)->item(0)->nodeValue . '<br/>';
}