想要使用PHP来删除没有id和class的表数据

时间:2017-11-22 21:26:12

标签: php web-scraping

这是我想要从中获取数据的表。我可以使用PHP dom刮掉它,但问题是我只想获取没有 - Vacent - 行的日期。我试过了4天但没有运气。

<table cellspacing="1" width="700px">
    <colgroup><col width="100px">
    <col width="100px">
    <col width="30px">
    <col width="30px">
    <col width="60px">
    <col width="40px">
    <col width="45px">

</colgroup><tbody><tr bgcolor="#d6d6d6">
    <th>From</th>
    <th>To</th>
    <th>In</th>
    <th>Out</th>
    <th>Name</th>
    <th>Adults</th>
    <th>Children</th>
    <th>Comment</th>
</tr>

<tr>

    <td nowrap="" style="border-bottom: 1px solid #888888">Nov Thu 23, 2017</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">Nov Fri 24, 2017</td>
    <td colspan="6" style="border-bottom: 1px solid #888888; color: #3333ff; text-align: center">-- Vacant --</td>


</tr>


<tr>


    <td nowrap="" style="border-bottom: 1px solid #888888">Nov Fri 24, 2017</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">Nov Mon 27, 2017</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">15:00&nbsp;</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">10:00&nbsp;</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">WILLIAMS, KEELY</td>
    <td style="border-bottom: 1px solid #888888">4&nbsp;</td>
    <td style="border-bottom: 1px solid #888888">0&nbsp;</td>
    <td style="border-bottom: 1px solid #888888">&nbsp;</td>

</tr>


<tr>

    <td nowrap="" style="border-bottom: 1px solid #888888">Nov Mon 27, 2017</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">Dec Thu 07, 2017</td>
    <td colspan="6" style="border-bottom: 1px solid #888888; color: #3333ff; text-align: center">-- Vacant --</td>


</tr>


<tr>


    <td nowrap="" style="border-bottom: 1px solid #888888">Dec Thu 07, 2017</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">Dec Sun 10, 2017</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">15:00&nbsp;</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">10:00&nbsp;</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">HALL, TYLER</td>
    <td style="border-bottom: 1px solid #888888">4&nbsp;</td>
    <td style="border-bottom: 1px solid #888888">0&nbsp;</td>
    <td style="border-bottom: 1px solid #888888">&nbsp;</td>

</tr>


<tr>

    <td nowrap="" style="border-bottom: 1px solid #888888">Dec Sun 10, 2017</td>
    <td nowrap="" style="border-bottom: 1px solid #888888">Dec Sat 16, 2017</td>
    <td colspan="6" style="border-bottom: 1px solid #888888; color: #3333ff; text-align: center">-- Vacant --</td>


</tr>
</tbody></table>

我只想获取&#34;来自&#34;和&#34; To&#34;字段值。但这里没有id或类,所以我使用了这个方法

$html = fetched HTML here;

$pokemon_doc = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)){ //if any html is actually returned

    $pokemon_doc->loadHTML($html);
    libxml_clear_errors(); //remove errors for yucky html

    $pokemon_xpath = new DOMXPath($pokemon_doc);

    //get all the h2's with an id
    $pokemon_row = $pokemon_xpath->query('//table//td[@style="border-bottom: 1px solid #888888"]');

    if($pokemon_row->length > 0){

        $oe = 1;
        foreach($pokemon_row as $row){
            if ($oe % 2 == 0) {
                //mysqli_query($con,"INSERT INTO booking VALUES('','','".(validateDate($row->nodeValue) ? $row->nodeValue : '')."')");
                echo (validateDate($row->nodeValue) && $row->nodeValue!='-- Vacant --' ? $row->nodeValue : '') . " | <br>";
            } else {
                //mysqli_query($con,"INSERT INTO booking VALUES('','".(validateDate($row->nodeValue) ? $row->nodeValue : '')."','')");
                echo (validateDate($row->nodeValue) && $row->nodeValue!='-- Vacant --' ? $row->nodeValue : '') . " , <br>";
            }

            $oe++;
        }
    }
} else {
    echo 'no html returend.';
}


// Check date validate function
function validateDate($date)
{
    $d = DateTime::createFromFormat('M D d, Y', $date);
    return $d && $d->format('M D d, Y') == $date;
}

问题是我不需要&#34; - 空缺 - &#34;日期。 我试过这段代码,但没有运气。

任何人都可以帮帮我。 感谢。

1 个答案:

答案 0 :(得分:1)

将其缩短并重新编写以在cli中使用它,但那些xpath查询对我有用:

$pokemon_row = $pokemon_xpath->query('//table//tr[not(contains(., \'-- Vacant --\'))]');

if($pokemon_row->length > 0) {
    $oe = 1;
    foreach($pokemon_row as $row) {
        $nodeList = $pokemon_xpath->query('td', $row);

        $fromNode = $nodeList->item(0);
        $toNode = $nodeList->item(1);

        echo 'From :'.(validateDate($fromNode->nodeValue) ? $fromNode->nodeValue : '') . PHP_EOL;
        echo 'To :'. (validateDate($toNode->nodeValue) ? $toNode->nodeValue : '') . PHP_EOL;

        $oe++;
    }
}