PHP从HTML页面的第二个表中收集过滤数据

时间:2014-08-13 21:13:06

标签: php html xpath web-scraping domdocument

我最近在这里很快解决了一个解析问题,但这是一个我无法击败的新挑战。

这里有一个包含多个表的(可怕的)html页面:mxs link 我感兴趣的表是代码中的第二个,就在

之下

<DIV CLASS="main"><H3>funrace.MXSConcept.com</H3><H3>Recent Races</H3>

我需要的是收集所有种族以在下拉框中获得类似的内容:

40 minutes ago - 8M+1L at 2013 Motosport World GP Rd 09: Lommel (2 riders)
1 day ago - 8M+1L at 2013 EMF FrenchCup Rd5 : Lacapelle Marival (1 riders)
...
as for exemple $date is the date,
$race is the second column,
$link is hidden but is the URL of the first column (to use later in my dropdown)

注意: 日期似乎是在飞行中生成的,有些线路谈论新的跟踪记录 - &gt;必须删除这些行。

这是我试过的(嘿别笑了!):

require('simple_html_dom.php');

    $doc = new DOMDocument;
    //$doc->preserveWhiteSpace = false;
    $doc->loadHTMLfile('http://mxsimulator.com/servers/mx.MXSConcept.com/');
    $xpath = new DOMXPath($doc);

    $table = array();
    $xpath = new DOMXPath($doc);

    $table2 = $doc->getElementsByTagName('table')->item(1);

    // collect data
    $data = array();
    foreach ($table2->query('//tr') as $node) {
        $rowData = array();
        foreach ($table2->query('td', $node) as $cell) {
            $rowData[] = $cell->nodeValue;
        }
    }

    print_r($data);

3 个答案:

答案 0 :(得分:1)

你必须使用  $ doc-&GT;负载(...) 对于外部文件。这里回答了类似的问题:Xpath and conditionally selecting descendants based on element value of ancestors

答案 1 :(得分:1)

首先,只需放弃require('simple_html_dom.php');,因为您正在使用DOMDocumentDOMXpath

其次,$table2->query('//tr')这将失败,因为它不是DOMXpath对象。它是DOMElement

$dom = new DOMDocument();
$dom->loadHTMLFile('http://mxsimulator.com/servers/mx.MXSConcept.com/');
$xpath = new DOMXpath($dom);

$data = array();
// target each table row of the first table
$target_table_rows = $xpath->query('//div[@class="main"]/table[1]/tr');
// if there are rows found,
if($target_table_rows->length > 0) {
    // for each row, loop it
    foreach($target_table_rows as $row_key => $row) {
        // if the first td cell of this current row is empty
        if(trim($xpath->query('./td[1]', $row)->item(0)->nodeValue) == '') {
            continue; // then skip it
        }
        $data[] = array(
            'datetime' => $xpath->query('./td[1]', $row)->item(0)->nodeValue,
            'link' => $xpath->query('./td[1]/a', $row)->item(0)->getAttribute('href'),
            'description' => $xpath->query('./td[2]', $row)->item(0)->nodeValue,
        );
    }
}

echo '<pre>';
print_r($data);

输出应如下所示:

Array
(
    [0] => Array
        (
            [datetime] => 2014-08-14 15:32 UTC
            [link] => /servers/mx.MXSConcept.com/races/825.html
            [description] => 8M+1L at 2013 Johnson Mine MX (1 riders)
        )
    ... and so on

答案 2 :(得分:1)

这是我需要更新链接的更新,但我确信这是一种更简单的方法。 目标是在同一个数组中有链接,这里我必须有第二个:

$dom = new DOMDocument();
    $dom->loadHTMLFile($selectserv);
    $xpath = new DOMXpath($dom);
    $data = array();
    $links = array();
    // target each table row of the first table
    $target_table_rows = $xpath->query('//div[@class="main"]/table[1]/tr');
    // if there are rows found,
    if($target_table_rows->length > 0) {
        // for each row, loop it
        foreach($target_table_rows as $row_key => $row) {
            // if the first td cell of this current row is empty
            if(trim($xpath->query('./td[1]', $row)->item(0)->nodeValue) == '') {
                continue; // then skip it
            }
            // each td of this current row, push it inside the array data
            foreach($row->childNodes as $td) {
                $data[$row_key][] = $td->nodeValue;
            }

        }
        foreach($target_table_rows as $container) {
            $arr = $container->getElementsByTagName("a"); //get href tags
            foreach($arr as $item) {
              $href =  $item->getAttribute("href"); //get the href value I think ?
              $links[] = array(
                'href' => $href //put href in the array
              );
            }
        }
    }