我正在编写一些代码来阅读内容的HTML文档,我已经能够确定我关心的所有内容都封装在标有样式的td
中:
<td style="padding:8px 10px;">
所以我使用html dom阅读器找到了所有符合该风格的<td>
,因为这些是我关心的。
这里的内容是一个锚,但它里面有标签。
我想要的是抓住<td>
中的主播内容,然后获取<b>
中的所有内容并抓取<i>
中的所有内容单独的变量我以后可以做些什么;也许在阵列中。
我已经能够执行print_r
个DOM节点,但它总是返回textContent而没有<b>
和<i>
即:
// Note the tabs are actually from the output
[textContent] => Wed 23rd Nov Red Hot Chilli Pipers
我的代码是:
// PHP
include("simple_html_dom.php");
$document = "item.html";
// Retrieve the DOM from a given URL
$html = file_get_html($document);
$dom = new DOMDocument;
$dom->loadHTML($html);
task($dom);
function task($dom)
{
// Get table
$items = $dom->getElementsByTagName('table');
$results = array();
foreach ($items as $item) {
$div_style = $item->getAttribute('style');
if ($div_style == "width:100%;border:1px solid #666;") {
$results[] = $item;
}
}
pre_print_r($results);
}
function pre_print_r($item)
{
print '<pre>';
print_r($item);
print '</pre>';
}
HTML通常是这样的;
// The tabs are actually part of the content and would need to be stripped out
<td style="padding:8px 10px;">
<a rel="nofollow" target="_blank" href="#/b4f0d" style="text-decoration:none;">
<b style="font-size:18px;font-weight:bold;">Wed 23rd Nov</b><br>
<i style="font-size:18px;">
Some intro text: <br>Some detail text </i>
</a>
</td>
PHP脚本返回输出,我得到:
Array
(
[0] => DOMElement Object
(
[tagName] => table
[schemaTypeInfo] =>
[nodeName] => table
[nodeValue] => Wed 23rd Nov Red Hot Chilli Pipers
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => table
[baseURI] =>
[textContent] => Wed 23rd Nov Red Hot Chilli Pipers
)
// ... array continues here
...
[编辑]
如果我试图
$items = $dom->getElementsByTagName('td');
$results = array();
foreach ($items as $item) {
$div_style = $item->getAttribute('style');
if ($div_style == "padding:8px 10px;") {
foreach ($item->childNodes as $childItem) {
pre_print_r($childItem->nodeValue);
}
}
}
这会输出<td>
的内容,但不会给我链接,<b>
或<i>
内容
按照
DOMElement Object
(
[tagName] => a
[schemaTypeInfo] =>
[nodeName] => a
[nodeValue] => Wed 23rd Nov Some artist
[nodeType] => 1
[parentNode] => (object value omitted)
[childNodes] => (object value omitted)
[firstChild] => (object value omitted)
[lastChild] => (object value omitted)
[previousSibling] => (object value omitted)
[nextSibling] => (object value omitted)
[attributes] => (object value omitted)
[ownerDocument] => (object value omitted)
[namespaceURI] =>
[prefix] =>
[localName] => a
[baseURI] =>
[textContent] => Wed 23rd Nov Some artist
)
...
所以,总结一下:
我能够
<td style="padding:8px 10px;">
=&gt; [OK] td
=&gt;内的内容[OK] 我的问题:
但我坚持的是;
<td>
<b>
内容<i>
内容答案 0 :(得分:0)
我现在能够解决问题了!
// Retrieve the DOM from a given URL
$html = file_get_html($document);
$dom = new DOMDocument;
$dom->loadHTML($html);
print '<p>Using:' . $document . '</p>';
$tds = $dom->getElementsByTagName('td');
$events = array();
foreach($tds as $td) {
$tag_style = $td->getAttribute('style');
if ($tag_style == "padding:8px 10px;") {
$links = $td->getElementsByTagName('a');
$boldTags = $td->getElementsByTagName('b');
$italicTags = $td->getElementsByTagName('i');
$eventDate = cleanup($boldTags->item(0)->textContent);
$eventTitle = cleanup($italicTags->item(0)->textContent);
$event = array();
foreach($links as $link) {
$event['a'] = $link->getAttribute('href');
$event['date'] = $eventDate;
$event['title'] = $eventTitle;
}
array_push($events, $event);
}
}
pre_print_r($events);
这让它起作用了!
示例:
Array
(
[0] => Array
(
[a] => Some hyperlink
[date] => Tue 12th Jul
[title] => Some title
)
[1] => Array
(
[a] => Some hyperlink
[date] => Thu 28th Jul
[title] => Some title 2
)
// array continues ...
)
我的问题现已解决。
谢谢!