Question

我正在编写一些代码来阅读内容的HTML文档，我已经能够确定我关心的所有内容都封装在标有样式的td中： <td style="padding:8px 10px;">

所以我使用html dom阅读器找到了所有符合该风格的<td>，因为这些是我关心的。

这里的内容是一个锚，但它里面有标签。

我想要的是抓住<td>中的主播内容，然后获取中的所有内容并抓取中的所有内容单独的变量我以后可以做些什么;也许在阵列中。

我已经能够执行print_r个DOM节点，但它总是返回textContent而没有和

即：

// Note the tabs are actually from the output 

[textContent] =>                                            Wed 23rd Nov                                                        Red Hot Chilli Pipers

我的代码是：

// PHP
include("simple_html_dom.php");
$document = "item.html";

// Retrieve the DOM from a given URL
$html = file_get_html($document);
$dom = new DOMDocument;
$dom->loadHTML($html);

task($dom);


function task($dom)
{

  // Get table
  $items = $dom->getElementsByTagName('table');

  $results = array();
  foreach ($items as $item) {
    $div_style = $item->getAttribute('style');
    if ($div_style == "width:100%;border:1px solid #666;") {
      $results[] = $item;
    }
  }

  pre_print_r($results);

}

function pre_print_r($item)
{
  print '<pre>';
  print_r($item);
  print '</pre>';
}

HTML通常是这样的;

// The tabs are actually part of the content and would need to be stripped out
<td style="padding:8px 10px;">
    <a rel="nofollow" target="_blank" href="#/b4f0d" style="text-decoration:none;">
        <b style="font-size:18px;font-weight:bold;">Wed 23rd Nov</b><br>
        <i style="font-size:18px;">
                                        Some intro text: <br>Some detail text                       </i>
    </a>
</td>

PHP脚本返回输出，我得到：

Array
(
    [0] => DOMElement Object
        (
            [tagName] => table
            [schemaTypeInfo] => 
            [nodeName] => table
            [nodeValue] =>                                              Wed 23rd Nov                                                        Red Hot Chilli Pipers                                                                                                                           
            [nodeType] => 1
            [parentNode] => (object value omitted)
            [childNodes] => (object value omitted)
            [firstChild] => (object value omitted)
            [lastChild] => (object value omitted)
            [previousSibling] => (object value omitted)
            [attributes] => (object value omitted)
            [ownerDocument] => (object value omitted)
            [namespaceURI] => 
            [prefix] => 
            [localName] => table
            [baseURI] => 
            [textContent] =>                                            Wed 23rd Nov                                                        Red Hot Chilli Pipers                                                                                                                           
        )
// ... array continues here

...

[编辑]

如果我试图

$items = $dom->getElementsByTagName('td');
$results = array();
  foreach ($items as $item) {
    $div_style = $item->getAttribute('style');
    if ($div_style == "padding:8px 10px;") {
      foreach ($item->childNodes as $childItem) {
        pre_print_r($childItem->nodeValue);
      }
    }
  }

这会输出<td>的内容，但不会给我链接，或内容

按照

DOMElement Object
(
    [tagName] => a
    [schemaTypeInfo] => 
    [nodeName] => a
    [nodeValue] =>                          Wed 23rd Nov                                                        Some artist                                         
    [nodeType] => 1
    [parentNode] => (object value omitted)
    [childNodes] => (object value omitted)
    [firstChild] => (object value omitted)
    [lastChild] => (object value omitted)
    [previousSibling] => (object value omitted)
    [nextSibling] => (object value omitted)
    [attributes] => (object value omitted)
    [ownerDocument] => (object value omitted)
    [namespaceURI] => 
    [prefix] => 
    [localName] => a
    [baseURI] => 
    [textContent] =>                        Wed 23rd Nov                                                        Some artist                                         
)

...

所以，总结一下：

我能够

抓住<td style="padding:8px 10px;"> =＆gt; [OK]
抓取td =＆gt;内的内容[OK]

我的问题：

但我坚持的是;

如何抓取<td>
如何获取链接中的所有内容
如何抓取链接中的所有内容

Answer 1

我现在能够解决问题了！

// Retrieve the DOM from a given URL
$html = file_get_html($document);
$dom = new DOMDocument;
$dom->loadHTML($html);

print '<p>Using:' . $document . '</p>';

$tds = $dom->getElementsByTagName('td');

$events = array();

foreach($tds as $td) {
    $tag_style = $td->getAttribute('style');
    if ($tag_style == "padding:8px 10px;") {
        $links = $td->getElementsByTagName('a');
        $boldTags = $td->getElementsByTagName('b');
        $italicTags = $td->getElementsByTagName('i');

        $eventDate = cleanup($boldTags->item(0)->textContent);
        $eventTitle = cleanup($italicTags->item(0)->textContent);

        $event = array();
        foreach($links as $link) {
            $event['a'] = $link->getAttribute('href');
            $event['date'] = $eventDate;
            $event['title'] = $eventTitle;
        }
        array_push($events, $event);
    }
}

pre_print_r($events);

这让它起作用了！

示例：

Array
(
    [0] => Array
        (
            [a] => Some hyperlink
            [date] => Tue 12th Jul
            [title] => Some title
        )

    [1] => Array
        (
            [a] => Some hyperlink
            [date] => Thu 28th Jul
            [title] => Some title 2
        )

// array continues ...

)

我的问题现已解决。

谢谢！

PHP - 从解析的DOM文档中获取链接内容

1 个答案: