PHP - 从解析的DOM文档中获取链接内容

时间:2016-05-23 12:41:16

标签: php parsing dom

我正在编写一些代码来阅读内容的HTML文档,我已经能够确定我关心的所有内容都封装在标有样式的td中: <td style="padding:8px 10px;">

所以我使用html dom阅读器找到了所有符合该风格的<td>,因为这些是我关心的。

这里的内容是一个锚,但它里面有标签。

我想要的是抓住<td>中的主播内容,然后获取<b>中的所有内容并抓取<i>中的所有内容单独的变量我以后可以做些什么;也许在阵列中。

我已经能够执行print_r个DOM节点,但它总是返回textContent而没有<b><i>

即:

// Note the tabs are actually from the output 

[textContent] =>                                            Wed 23rd Nov                                                        Red Hot Chilli Pipers       

我的代码是:

// PHP
include("simple_html_dom.php");
$document = "item.html";

// Retrieve the DOM from a given URL
$html = file_get_html($document);
$dom = new DOMDocument;
$dom->loadHTML($html);

task($dom);


function task($dom)
{

  // Get table
  $items = $dom->getElementsByTagName('table');

  $results = array();
  foreach ($items as $item) {
    $div_style = $item->getAttribute('style');
    if ($div_style == "width:100%;border:1px solid #666;") {
      $results[] = $item;
    }
  }

  pre_print_r($results);

}

function pre_print_r($item)
{
  print '<pre>';
  print_r($item);
  print '</pre>';
}

HTML通常是这样的;

// The tabs are actually part of the content and would need to be stripped out
<td style="padding:8px 10px;">
    <a rel="nofollow" target="_blank" href="#/b4f0d" style="text-decoration:none;">
        <b style="font-size:18px;font-weight:bold;">Wed 23rd Nov</b><br>
        <i style="font-size:18px;">
                                        Some intro text: <br>Some detail text                       </i>
    </a>
</td>

PHP脚本返回输出,我得到:

Array
(
    [0] => DOMElement Object
        (
            [tagName] => table
            [schemaTypeInfo] => 
            [nodeName] => table
            [nodeValue] =>                                              Wed 23rd Nov                                                        Red Hot Chilli Pipers                                                                                                                           
            [nodeType] => 1
            [parentNode] => (object value omitted)
            [childNodes] => (object value omitted)
            [firstChild] => (object value omitted)
            [lastChild] => (object value omitted)
            [previousSibling] => (object value omitted)
            [attributes] => (object value omitted)
            [ownerDocument] => (object value omitted)
            [namespaceURI] => 
            [prefix] => 
            [localName] => table
            [baseURI] => 
            [textContent] =>                                            Wed 23rd Nov                                                        Red Hot Chilli Pipers                                                                                                                           
        )
// ... array continues here

...

[编辑]

如果我试图

$items = $dom->getElementsByTagName('td');
$results = array();
  foreach ($items as $item) {
    $div_style = $item->getAttribute('style');
    if ($div_style == "padding:8px 10px;") {
      foreach ($item->childNodes as $childItem) {
        pre_print_r($childItem->nodeValue);
      }
    }
  }

这会输出<td>的内容,但不会给我链接,<b><i>内容

按照

DOMElement Object
(
    [tagName] => a
    [schemaTypeInfo] => 
    [nodeName] => a
    [nodeValue] =>                          Wed 23rd Nov                                                        Some artist                                         
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => (object value omitted)
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => a
    [baseURI] => 
    [textContent] =>                        Wed 23rd Nov                                                        Some artist                                         
)

...

所以,总结一下:

我能够

  1. 抓住<td style="padding:8px 10px;"> =&gt; [OK]
  2. 抓取td =&gt;内的内容[OK]
  3. 我的问题:

    但我坚持的是;

    1. 如何抓取<td>
    2. 中的链接
    3. 如何获取链接中的所有<b>内容
    4. 如何抓取链接中的所有<i>内容

1 个答案:

答案 0 :(得分:0)

我现在能够解决问题了!

// Retrieve the DOM from a given URL
$html = file_get_html($document);
$dom = new DOMDocument;
$dom->loadHTML($html);

print '<p>Using:' . $document . '</p>';

$tds = $dom->getElementsByTagName('td');

$events = array();

foreach($tds as $td) {
    $tag_style = $td->getAttribute('style');
    if ($tag_style == "padding:8px 10px;") {
        $links = $td->getElementsByTagName('a');
        $boldTags = $td->getElementsByTagName('b');
        $italicTags = $td->getElementsByTagName('i');

        $eventDate = cleanup($boldTags->item(0)->textContent);
        $eventTitle = cleanup($italicTags->item(0)->textContent);

        $event = array();
        foreach($links as $link) {
            $event['a'] = $link->getAttribute('href');
            $event['date'] = $eventDate;
            $event['title'] = $eventTitle;
        }
        array_push($events, $event);
    }
}

pre_print_r($events);

这让它起作用了!

示例:

Array
(
    [0] => Array
        (
            [a] => Some hyperlink
            [date] => Tue 12th Jul
            [title] => Some title
        )

    [1] => Array
        (
            [a] => Some hyperlink
            [date] => Thu 28th Jul
            [title] => Some title 2
        )

// array continues ...

)

我的问题现已解决。

谢谢!