simplexml无法加载<a> tag classes?

时间:2015-10-20 16:20:46

标签: php html web-scraping simpledom

I have a bit of php that grabs the html from a page and loads it into a simplexml object. However its not getting the classes of the element within a

The php

//load the html page with curl
$html = curl_exec($ch);
curl_close($ch);

$doc = new DOMDocument();
$doc->loadHTML($html);
$sxml = simplexml_import_dom($doc);

The page html. Which if I do a var_dump of $html shows its been scraped and exists in $html

    <li class="large">
        <a style="" id="ref_3" class="off" href="#" onmouseover="highlightme('07');return false;" onclick="req('379');return false;" title="">07</a>
    </li>

The var_dump (below) of $doc and $sxml show that the a class of 'off' is now missing. Unfortunately I need to process the page based on this class.

            [8]=>
             object(SimpleXMLElement)#50 (2) {
              ["@attributes"]=>
              array(1) {
                ["class"]=>
                string(16) "large"
              }
              ["a"]=>
              string(2) "08"
            }

1 个答案:

答案 0 :(得分:1)

使用simplexml_load_filexpath,请参阅内嵌评论。

你所追求的是什么,真的,一旦找到你需要的元素就是这个

$row->a->attributes()->class=="off"

以下完整代码:

// let's take all the divs that have the class "stff_grid"
$divs = $xml->xpath("//*[@class='stff_grid']");

// for each of these elements, let's print out the value inside the first p tag
foreach($divs as $div){
    print $div->p->a . PHP_EOL;

    // now for each li tag let's print out the contents inside the a tag
    foreach ($div->ul->li as $row){

        // same as before
        print "  - " . $row->a;
        if ($row->a->attributes()->class=="off") print " *off*";
        print PHP_EOL;

        // or shorter
        // print "  - " . $row->a . (($row->a->attributes()->class=="off")?" *off*":"") . PHP_EOL;

    }
}
/* this outputs the following
Person 1
  - 1 hr *off*
  - 2 hr
  - 3 hr *off*
  - 4 hr
  - 5 hr
  - 6 hr *off*
  - 7 hr *off*
  - 8 hr
Person 2
  - 1 hr
  - 2 hr
  - 3 hr
  - 4 hr
  - 5 hr
  - 6 hr
  - 7 hr *off*
  - 8 hr *off*
Person 3
  - 1 hr
  - 2 hr
  - 3 hr
  - 4 hr *off*
  - 5 hr
  - 6 hr
  - 7 hr *off*
  - 8 hr
*/