我试图解析一个在表格中显示日历事件的网站,并遇到一些奇怪的行为。
----------------------------- date 1| - 1st event this date | - 2nd event this date ----------------------------- date 2| - 1st event this date | - 2nd event this date ----------------------------- date 3| - 1st event this date ----------------------------- date 4| - 1st event this date -----------------------------
正如您所看到的,它基本上是<table>
,其中每个<tr>
代表一个日期:
<td>
的第一个class="ev_td_left"
包含我要解析的日期字符串。<td>
的第二个class="ev_td_right"
包含一个无序的ist,其中每个<li class="ev_td_li">
代表一个事件条目。
我尝试使用simple_html_dom.php解析它:
foreach($html ->find('#jevents_body table.ev_table tbody tr') as $tr){
$dateEl = $tr ->find("td.ev_td_left text", 0);
$eventDate = parseDate($dateEl ->plaintext);
// Iterate through all events this date
foreach($tr ->find('li.ev_td_li') as $li) {
// Get the event title
$title = ($li ->find('a.ev_link_row', 0)) ->plaintext;
print("Parsed: [$title, $eventDate]\r\n");
}
}
似乎它以某种方式解析了整个页面两次。我的输出看起来有点像:
Parsed: [1st event this date, date 1]
Parsed: [2nd event this date, date 1]
Parsed: [1st event this date, date 2]
Parsed: [2nd event this date, date 2]
Parsed: [1st event this date, date 3]
Parsed: [1st event this date, date 4]
//and here it runs again...
Parsed: [1st event this date, date 1]
Parsed: [2nd event this date, date 1]
Parsed: [1st event this date, date 2]
Parsed: [2nd event this date, date 2]
Parsed: [1st event this date, date 3]
Parsed: [1st event this date, date 4]
有人知道问题出在哪里吗?
正如所建议的,这里是html标记。 (这很糟糕): http://www.akg-bensheim.de/termine/range.listevents/-
这会产生此输出:
Parsed: [Vorstand des Fördervereins, 2015-04-29]
Parsed: [Beginn der sportpraktischen Abiturprüfungen, 2015-04-29]
Parsed: [Christi Himmelfahrt, 2015-04-29]
Parsed: [Brückentag / beweglicher Ferientag, 2015-04-29]
Parsed: [Pfingstmontag, 2015-04-29]
Parsed: [Bundesjugendspiele, 2015-04-29]
Parsed: [Unterrichtsfrei wegen mündl. Abitur, 2015-04-29]
Parsed: [Mündliche Abiturprüfungen, 2015-04-29]
Parsed: [Fronleichnam, 2015-04-29]
Parsed: [Brückentag / beweglicher Ferientag, 2015-04-29]
Parsed: [Pensionäre: Sommerstammtisch, 2015-04-29]
Parsed: [Abiturienten-Gottesdienst, 2015-04-29]
Parsed: [Akademische Abitur-Feier, 2015-04-29]
Parsed: [Abi-Ball, 2015-04-29]
Parsed: [Sommerferien, 2015-04-29]
Parsed: [Vorstand des Fördervereins, 2015-04-29]
Parsed: [Beginn der sportpraktischen Abiturprüfungen, 2015-05-04]
Parsed: [Christi Himmelfahrt, 2015-05-14]
Parsed: [Brückentag / beweglicher Ferientag, 2015-05-15]
Parsed: [Pfingstmontag, 2015-05-25]
Parsed: [Bundesjugendspiele, 2015-05-28]
Parsed: [Unterrichtsfrei wegen mündl. Abitur, 2015-05-29]
Parsed: [Mündliche Abiturprüfungen, 2015-05-29]
Parsed: [Fronleichnam, 2015-06-04]
Parsed: [Brückentag / beweglicher Ferientag, 2015-06-05]
Parsed: [Pensionäre: Sommerstammtisch, 2015-06-09]
Parsed: [Abiturienten-Gottesdienst, 2015-06-24]
Parsed: [Akademische Abitur-Feier, 2015-06-25]
Parsed: [Abi-Ball, 2015-06-27]
Parsed: [Sommerferien, 2015-07-27]
正如你所看到的,它以某种方式解析整个事情两次!
答案 0 :(得分:0)
好的,我已经想出了解决这个问题的方法,虽然我还不知道为什么解析器表现得如此奇怪。
我基本上最终检查了每个表行的plaintext
属性,如果它有一个空文本,则跳转到下一个循环:
foreach($html ->find('#jevents_body table.ev_table tbody tr') as $tr) {
$tmp = trim($tr ->plaintext);
if(empty($tmp)) {
continue;
}
//Parsing
...
}