从网页中提取价值

时间:2009-10-20 14:36:04

标签: php screen-scraping

您好我有一个网站的主页,我正在使用Curl阅读,我需要获取该网站的页数。

信息在div中: -

<div class="pager">
<span class="page-numbers current">1</span>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a>
<a href="/users?page=3" title="go to page 3"><span class="page-numbers">3</span></a>
<a href="/users?page=4" title="go to page 4"><span class="page-numbers">4</span></a>
<a href="/users?page=5" title="go to page 5"><span class="page-numbers">5</span></a>
<span class="page-numbers dots">&hellip;</span>

<a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers next"> next</span></a>
</div>

我需要的值是15,但这可能是任何数字,具体取决于网站,但总是在同一位置。

如何轻松读取此值并将其分配给PHP中的变量。

由于

乔纳森

6 个答案:

答案 0 :(得分:2)

您可以使用PHP's DOM module。使用DOMDocument :: loadhtmlfile()读取页面,然后创建一个DOMXPath对象并查询具有class =“page-numbers”属性的文档中的所有span元素。

(编辑:哎呀,这不是你想要的,请看第二个代码片段)

$html = '<html><head><title>:::</title></head><body>
<div class="pager">
<span class="page-numbers current">1</span>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers">2</span></a>
<a href="/users?page=3" title="go to page 3"><span class="page-numbers">3</span></a>
<a href="/users?page=4" title="go to page 4"><span class="page-numbers">4</span></a>
<a href="/users?page=5" title="go to page 5"><span class="page-numbers">5</span></a>
<span class="page-numbers dots">&hellip;</span>

<a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>
<a href="/users?page=2" title="go to page 2"><span class="page-numbers next"> next</span></a>
</div>
</body></html>';

$doc = new DOMDocument;
// since the content "is already here" we use loadhtml(content)
// instead of loadhtmlfile(url) 
$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//span[@class="page-numbers"]');
echo 'there are ', $nodelist->length, ' span elements having class="page-numbers"';

编辑:做这个

<a href="/users?page=15" title="go to page 15"><span class="page-numbers">15</span></a>

(最后一个a元素)始终指向最后一页,即此链接是否包含您要查找的值?
然后,您可以使用XPath表达式选择第二个但最后a个元素,并从那里选择其子span元素。

//div[@class="pager"] <- select each <div> where the attribute class equals "pager"
//div[@class="pager"]/a <- select each <a> that is a direct child of the pager div
//div[@class="pager"]/a[position()=last()-1] <- select the <a> that is second but last
//div[@class="pager"]/a[position()=last()-1]/span <- select the direct child <span> of that second but last <a> element in the pager <div>

(您可能想要获取一个好的XPath教程;-))

$doc->loadhtml($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[@class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {
  echo $nodelist->item(0)->nodeValue;
}
else {
  echo 'not found';
}

答案 1 :(得分:0)

没有直接的功能或简单的方法来做到这一点。您需要构建或使用existing HTML parser来执行此操作。

答案 2 :(得分:0)

您可以使用正则表达式解析它。首先查找<span class="page-numbers">的所有事件,然后选择最后一个:

// div html code should be in $div_html
preg_match_all('#<span class="page-numbers">(\d+)#', $div_html, $page_numbers);
print_r(end($page_numbers[1])); // prints 15

答案 3 :(得分:0)

这是你可能想要使用xpath的东西 - 这需要将页面加载为dom文档对象:

$domDoc = new DOMDocument();
$domDoc->loadHTMLFile("http://path/to/yourfile.html");
$xp = new DOMXPath($domDoc);
$nodes = $xp->query("//xpath/to/relevant/node");
$value = $nodes[0];

我有一段时间没有写好的xpath,所以你应该做一些阅读来弄清楚那个部分,但这不应该太难。

答案 4 :(得分:0)

也许

$nodes = $dom->getElementsByTagName("span");
$maxPageNum = 0;
foreach($nodes as $node)
{
    if( $node.class == "page-numbers" && $node.value > $maxPageNum )
    {
        $maxPageNum = $node.value;
    }
}

我不知道PHP,所以也许访问dom节点的类/内部文本并不容易,但必须有一些方法来获取该信息,这里的伪代码应该可以工作。

答案 5 :(得分:0)

只是想非常感谢Volkerk提供帮助 - 它运作得非常好。我不得不做一些小改动,最后得到了这个: -

function getusers($userurl)
{
$sSourceData = file_get_contents($userurl);
$doc = new DOMDocument();
@$doc->loadHTML($sSourceData);

$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[@class="pager"]/a[position()=last()-1]/span');
if ( 0 < $nodelist->length ) {

  $lastpage = $nodelist->item(0)->nodeValue;
  $users = $lastpage * 35;
  $userurl = $userurl.'?page='.$lastpage;

  $sSourceData = file_get_contents($userurl);

$doc = new DOMDocument();
@$doc->loadHTML($sSourceData);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[@class="user-details"]');
$users = $users + $nodelist->length;
echo 'there are ', $users , ' users';

}
else {
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query('//div[@class="user-details"]');
echo 'there are ', $nodelist->length, ' users';
}


}