解析维基百科页面表问题

时间:2013-05-25 23:13:18

标签: php html parsing

您好我正在尝试解析一个Wikipedia文档,其中有一个名为“infobox biota”的表格。我正在尝试获取具有以下特征的下表数据和类

王国:
门:
亚门:
类别:
订购:
家族:

<table class="infobox biota" style="text-align: left; width: 200px; font-size: 100%">
<tbody><tr>
<th colspan="2" style="text-align: center; background-color: rgb(211,211,164)">Rabbit</th>
</tr>
<tr>
<td colspan="2" style="text-align: center"><a href="/wiki/File:Rabbit_in_montana.jpg" class="image"><img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Rabbit_in_montana.jpg/250px-Rabbit_in_montana.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Rabbit_in_montana.jpg/375px-Rabbit_in_montana.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Rabbit_in_montana.jpg/500px-Rabbit_in_montana.jpg 2x" height="222" width="250"></a></td>
</tr>
<tr>
<th colspan="2" style="text-align: center; background-color: rgb(211,211,164)"><a href="/wiki/Biological_classification" title="Biological classification">Scientific classification</a></th>
</tr>
<tr>
<td>Kingdom:</td>
<td><span class="kingdom" style="white-space:nowrap;"><a href="/wiki/Animal" title="Animal">Animalia</a></span></td>
</tr>
<tr>
<td>Phylum:</td>
<td><span class="phylum" style="white-space:nowrap;"><a href="/wiki/Chordate" title="Chordate">Chordata</a></span></td>
</tr>
<tr>
<td>Subphylum:</td>
<td><span class="subphylum" style="white-space:nowrap;"><a href="/wiki/Vertebrata" title="Vertebrata" class="mw-redirect">Vertebrata</a></span></td>
</tr>
<tr>
<td>Class:</td>
<td><span class="class" style="white-space:nowrap;"><a href="/wiki/Mammal" title="Mammal">Mammalia</a></span></td>
</tr>
<tr>
<td>Order:</td>
<td><span class="order" style="white-space:nowrap;"><a href="/wiki/Lagomorpha" title="Lagomorpha">Lagomorpha</a></span></td>
</tr>
<tr>
<td>Family:</td>
<td><span class="family" style="white-space:nowrap;"><a href="/wiki/Leporidae" title="Leporidae">Leporidae</a><br>
<small>in part</small></span></td>
</tr>
<tr>
<th colspan="2" style="text-align: center; background-color: rgb(211,211,164)">Genera</th>
</tr>
<tr>
<td colspan="2" style="text-align: left">
<div>
<table style="background-color:transparent;table-layout:fixed;" border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody><tr valign="top">
<td>
<div style="margin-right:20px;">
<p><i><a href="/wiki/Pentalagus" title="Pentalagus" class="mw-redirect">Pentalagus</a></i><br>
<i><a href="/wiki/Bunolagus" title="Bunolagus" class="mw-redirect">Bunolagus</a></i><br>
<i><a href="/wiki/Nesolagus" title="Nesolagus">Nesolagus</a></i><br>
<i><a href="/wiki/Romerolagus" title="Romerolagus" class="mw-redirect">Romerolagus</a></i></p>
</div>
</td>
<td>
<div style="margin-right: 20px;">
<p><i><a href="/wiki/Brachylagus" title="Brachylagus" class="mw-redirect">Brachylagus</a></i><br>
<i><a href="/wiki/Sylvilagus" title="Sylvilagus" class="mw-redirect">Sylvilagus</a></i><br>
<i><a href="/wiki/European_Rabbit" title="European Rabbit" class="mw-redirect">Oryctolagus</a></i><br>
<i><a href="/wiki/Poelagus" title="Poelagus" class="mw-redirect">Poelagus</a></i></p>
</div>
</td>
</tr>
</tbody></table>
</div>
</td>
</tr>
</tbody></table>

这是我尝试解析并获得具有桌子结构的兔子的王国,门,亚门,类,顺序和家庭。但是我得到以下数组([Kingdom:] =&gt; [Phylum:] =&gt; [Subphylum:] =&gt; [Class:] =&gt; [Order:] =&gt; [Family:] =&gt; [

Pentalagus Bunolagus 苏门答腊兔属 Romerolagus ] =&gt; ) 它没有用兔子的数据填充数组。它也在下面的行中给出了一个解析错误,可能出错了什么?

<?php
//require"mydb.php";
header('Content-type: text/html; charset=utf-8'); // this just makes sure encoding is right
include('simple_html_dom.php'); // the parser library

$html = file_get_html('http://en.wikipedia.org/wiki/Rabbit');
$table = $html->find('table.infobox');

$data = array();

foreach($table[0]->find('tr') as $row)
{    
    $td = $row->find('> td');

    if (count($td) == 2)
    {
        $name = $td[0]->innertext;
        $text = $td[1]->find('a')[0]->innertext;   //PARSE ERROR IS GIVEN HERE, after the find('a')[0], taking off the array takes away the error but just me no results

        $data[$name] = $text;
    }
}

print_r($data);
?>

1 个答案:

答案 0 :(得分:3)

$text = $td[1]->find('a')[0]->innertext; 

在这一行中你是dereferencing a function。这仅适用于PHP 5.4或更高版本。试试这个:

$td = $td[1]->find('a');
$text = $td[0]->innertext;