简单的html dom奇怪的行为

时间:2015-04-18 10:41:05

标签: php html css parsing simple-html-dom


我试图解析一个在表格中显示日历事件的网站,并遇到一些奇怪的行为。

html结构:

-----------------------------
date 1| - 1st event this date
      | - 2nd event this date
-----------------------------
date 2| - 1st event this date
      | - 2nd event this date
-----------------------------
date 3| - 1st event this date
-----------------------------
date 4| - 1st event this date
-----------------------------

正如您所看到的,它基本上是<table>,其中每个<tr>代表一个日期:

  • 包含类属性<td>的第一个class="ev_td_left"包含我要解析的日期字符串。
  • 包含类属性<td>的第二个class="ev_td_right"包含一个无序的ist,其中每个<li class="ev_td_li">代表一个事件条目。


我尝试过的事情:

我尝试使用simple_html_dom.php解析它:

foreach($html ->find('#jevents_body table.ev_table tbody tr') as $tr){

    $dateEl = $tr ->find("td.ev_td_left text", 0);
    $eventDate = parseDate($dateEl ->plaintext);

    // Iterate through all events this date
    foreach($tr ->find('li.ev_td_li') as $li) {

        // Get the event title
        $title = ($li ->find('a.ev_link_row', 0))  ->plaintext;
        print("Parsed: [$title, $eventDate]\r\n");
    }
}


问题:

似乎它以某种方式解析了整个页面两次。我的输出看起来有点像:

Parsed: [1st event this date, date 1]
Parsed: [2nd event this date, date 1]
Parsed: [1st event this date, date 2]
Parsed: [2nd event this date, date 2]
Parsed: [1st event this date, date 3]
Parsed: [1st event this date, date 4]

//and here it runs again...
Parsed: [1st event this date, date 1]
Parsed: [2nd event this date, date 1]
Parsed: [1st event this date, date 2]
Parsed: [2nd event this date, date 2]
Parsed: [1st event this date, date 3]
Parsed: [1st event this date, date 4]

有人知道问题出在哪里吗?


编辑1:标记:

正如所建议的,这里是html标记。 (这很糟糕): http://www.akg-bensheim.de/termine/range.listevents/-

这会产生此输出:

Parsed: [Vorstand des Fördervereins, 2015-04-29]
Parsed: [Beginn der sportpraktischen Abiturprüfungen, 2015-04-29]
Parsed: [Christi Himmelfahrt, 2015-04-29]
Parsed: [Brückentag / beweglicher Ferientag, 2015-04-29]
Parsed: [Pfingstmontag, 2015-04-29]
Parsed: [Bundesjugendspiele, 2015-04-29]
Parsed: [Unterrichtsfrei wegen mündl. Abitur, 2015-04-29]
Parsed: [Mündliche Abiturprüfungen, 2015-04-29]
Parsed: [Fronleichnam, 2015-04-29]
Parsed: [Brückentag / beweglicher Ferientag, 2015-04-29]
Parsed: [Pensionäre: Sommerstammtisch, 2015-04-29]
Parsed: [Abiturienten-Gottesdienst, 2015-04-29]
Parsed: [Akademische Abitur-Feier, 2015-04-29]
Parsed: [Abi-Ball, 2015-04-29]
Parsed: [Sommerferien, 2015-04-29]
Parsed: [Vorstand des Fördervereins, 2015-04-29]
Parsed: [Beginn der sportpraktischen Abiturprüfungen, 2015-05-04]
Parsed: [Christi Himmelfahrt, 2015-05-14]
Parsed: [Brückentag / beweglicher Ferientag, 2015-05-15]
Parsed: [Pfingstmontag, 2015-05-25]
Parsed: [Bundesjugendspiele, 2015-05-28]
Parsed: [Unterrichtsfrei wegen mündl. Abitur, 2015-05-29]
Parsed: [Mündliche Abiturprüfungen, 2015-05-29]
Parsed: [Fronleichnam, 2015-06-04]
Parsed: [Brückentag / beweglicher Ferientag, 2015-06-05]
Parsed: [Pensionäre: Sommerstammtisch, 2015-06-09]
Parsed: [Abiturienten-Gottesdienst, 2015-06-24]
Parsed: [Akademische Abitur-Feier, 2015-06-25]
Parsed: [Abi-Ball, 2015-06-27]
Parsed: [Sommerferien, 2015-07-27]

正如你所看到的,它以某种方式解析整个事情两次!

1 个答案:

答案 0 :(得分:0)

好的,我已经想出了解决这个问题的方法,虽然我还不知道为什么解析器表现得如此奇怪。

我基本上最终检查了每个表行的plaintext属性,如果它有一个空文本,则跳转到下一个循环:

foreach($html ->find('#jevents_body table.ev_table tbody tr') as $tr) {
    $tmp = trim($tr ->plaintext);
    if(empty($tmp)) {
        continue;
    }

   //Parsing
   ...
}