PHP和DOM提取网页的这些特定数据

时间:2012-07-24 00:37:00

标签: php dom web-scraping

使用PHP和DOM如何从以下代码(网页的一部分)获取PLACE,ADDRESS,LOCALITY,REGION,POSTAL CODE和COUNTRY。

从现在开始,我已经开发了一部分代码来获取其他内容。这是到目前为止的代码。

$dochtml = new DOMDocument();
$dochtml->loadHTMLfile('');
$xpath = new DOMXpath($dochtml);

$descr = $xpath->query('//div[@class="description"]')->item(0);
    print_r($descr->nodeValue);

$abbr  = $dochtml->getElementsByTagName("abbr")->item(0);
    $title = $abbr->getAttribute("title");
    echo $title;

这是代码的其余部分。

<div class="vcard location p">
    <div class="fn org">
        <a href="link here">PLACE</a>
    </div>
    <div class="adr">
        <div class="street-address">ADDRESS<br></div>
        <div>
            <span class="locality">LOCALITY</span>,
            <span class="region">REGION</span>
            <span class="postal-code">POSTAL CODE</span>,
            <span class="country-name">COUNTRY</span>
        </div>
    </div>
</div>

更新

我对以下内容存在一个小问题,在页面中有很多<abbr>代码,但我想要的两个代码dtstartdtend如下所示在#eventDetailInfo内。遗憾的是,并非所有标记都包含abbr的{​​{1}}标记,因此它会从“相关事件”中获得第一个标记。所以我的问题是如何将其仅限于此特定ID?

class=dtend

3 个答案:

答案 0 :(得分:3)

通过阅读DOMXPath documentation,我建议的解决方案概述如下。

按类别获取元素

$nodes = $xpath->query('//div[contains(@class, "street-address")]');

按ID获取元素

$node = $xpath->query('//div[@id="someid"]');

<强>解决方案

要提取您的值,您可以使用类似(working example)的内容:

<?php
$html = '<div class="vcard location p">
    <div class="fn org">
        <a href="link here">PLACE</a>
    </div>
    <div class="adr">
        <div class="street-address">ADDRESS<br></div>
        <div>
            <span class="locality">LOCALITY</span>,
            <span class="region">REGION</span>
            <span class="postal-code">POSTAL CODE</span>,
            <span class="country-name">COUNTRY</span>
        </div>
    </div>
    <div id="eventDetailInfo">
        <div class="p">
         <div><abbr class="dtstart" title="2012-07-16T21:00:00">Monday, July 16th, 2012</abbr></div>    
         <div><abbr class="dtend" title="2012-08-16T21:00:00">Monday, August 16th, 2012</abbr></div>    
        </div>
    </div>
</div>';

$document = new DOMDocument();
$document->loadHTML($html);
$xPath = new DOMXpath($document);

function extractNodeValue($query, $xPath, $attribute = null) {
    $node = $xPath->query("//{$query}")->item(0);
    if (!$node) {
        return null;
    }
    return $attribute ? $node->getAttribute($attribute) : $node->nodeValue;
}

$place = extractNodeValue('div[contains(@class, "fn")]/a', $xPath);
$address = extractNodeValue('div[contains(@class, "street-address")]',$xPath);
$locality = extractNodeValue('span[contains(@class, "locality")]',$xPath);
$region = extractNodeValue('span[contains(@class, "region")]', $xPath);
$postalCode = extractNodeValue('span[contains(@class, "postal-code")]', $xPath);
$countryName = extractNodeValue('span[contains(@class, "country-name")]', $xPath);
$start = extractNodeValue('div[@id="eventDetailInfo"]/div/div/abbr[contains(@class, "dtstart")]', $xPath, 'title');
$end = extractNodeValue('div[@id="eventDetailInfo"]/div/div/abbr[contains(@class, "dtend")]', $xPath, 'title');

var_dump($place, $address, $locality, $region, $postalCode, $countryName, $start, $end);

输出:

string(5) "PLACE" string(7) "ADDRESS" string(8) "LOCALITY" string(6) "REGION" string(11) "POSTAL CODE" string(7) "COUNTRY" string(19) "2012-07-16T21:00:00" string(19) "2012-08-16T21:00:00"

答案 1 :(得分:0)

你的代码差不多完成了:

<?php

$dochtml = new DOMDocument();
$dochtml->loadHTML('<div class="vcard location p">
    <div class="fn org">
        <a href="link here">PLACE</a>
    </div>
    <div class="adr">
        <div class="street-address">ADDRESS<br></div>
        <div>
            <span class="locality">LOCALITY</span>,
            <span class="region">REGION</span>
            <span class="postal-code">POSTAL CODE</span>,
            <span class="country-name">COUNTRY</span>
        </div>
    </div>
</div>');

$xpath = new DOMXpath($dochtml);

$place       = $xpath->query('//div[@class="fn org"]/a')->item(0)->nodeValue;
$address     = $xpath->query('//div[@class="street-address"]')->item(0)->nodeValue;
$locality    = $xpath->query('//span[@class="locality"]')->item(0)->nodeValue;
$region      = $xpath->query('//span[@class="region"]')->item(0)->nodeValue;
$postalCode  = $xpath->query('//span[@class="postal-code"]')->item(0)->nodeValue;
$countryName = $xpath->query('//span[@class="country-name"]')->item(0)->nodeValue;

实时代码available here

答案 2 :(得分:-1)

如果您了解CSS选择器,请使用PHPQuery或类似的库。