我正在使用django beautifulsoup来获取html表格中的所有数据。我有代码剥离表并将表数据保存为列表列表:
soup = bs.BeautifulSoup(html_source, 'lxml')
table = soup.find('table', {'id': 'detail'})
rows = table.findAll('tr')
data = [[td.findChildren(text=True) for td in tr.findAll(['th', 'td'])] for tr in rows]
data = [[u"".join(d).strip() for d in l] for l in data]
此代码到目前为止运行良好,但不知何故它不会捕获此html表的整个数据。它只获得thead
行。我无法弄清楚为什么?
<table class="table_type1" data-tdborder="" id="detail">
<colgroup>
<col width="38">
<col>
<col>
<col width="140">
</colgroup>
<thead>
<tr>
<th>No.</th>
<th>Status</th>
<th>Location</th>
<th>Event Date</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:center;">1</td>
<td class="multi_row" style="line-height:15px;">Empty Container Release to Shipper</td>
<td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br> <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA49')" title="GREATING FORTUNE (SHANGHAI) CONTAIN">GREATING FORTUNE (SHANGHAI) CONTAIN</a></td>
<td class="ico_a">2017-10-09 10:51</td>
</tr>
<tr>
<td style="text-align:center;">2</td>
<td class="multi_row" style="line-height:15px;">Gate In to Outbound Terminal</td>
<td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br> <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
<td class="ico_a">2017-10-10 04:43</td>
</tr>
<tr>
<td style="text-align:center;">3</td>
<td class="multi_row" style="line-height:15px;">Loaded on 'NYK LYNX 2610E' at Port of Loading<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
<td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br> <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
<td class="ico_a">2017-10-11 22:58</td>
</tr>
<tr>
<td style="text-align:center;">4</td>
<td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' Departure from Port of Loading<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
<td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br> <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
<td class="ico_a">2017-10-12 05:00</td>
</tr>
<tr>
<td style="text-align:center;">5</td>
<td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' Arrival at Port of Discharging<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
<td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br> <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
<td class="ico_e">2017-11-14 21:00</td>
</tr>
<tr>
<td style="text-align:center;">6</td>
<td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' POD Berthing Destination<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
<td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br> <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
<td class="ico_e">2017-11-14 22:00</td>
</tr>
<tr>
<td style="text-align:center;">7</td>
<td class="multi_row" style="line-height:15px;">Unloaded from 'NYK LYNX 2610E' at Port of Discharging<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
<td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br> <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
<td class="ico_e">2017-11-14 23:30</td>
</tr>
<tr>
<td style="text-align:center;">8</td>
<td class="multi_row" style="line-height:15px;">Gate Out from Inbound Terminal for Delivery to Consignee (or Port Shuttle)</td>
<td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br> <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
<td class="ico_e">2017-11-15 04:00</td>
</tr>
<tr>
<td style="text-align:center;">9</td>
<td class="multi_row" style="line-height:15px;">Empty Container Returned from Customer</td>
<td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br> </td>
<td class="ico_e">2017-11-15 10:00</td>
</tr>
</tbody>
</table>
修改
我打印了soup
个对象并浏览了所有html
代码,令人惊讶的是它只包含表格的thead
而不是tbody
,这是一个错误boutifulsoup?这是beautifulsoup4捕获的表的唯一部分:
<table class="table_type1" data-tdborder="" id="detail">
<colgroup>
<col width="38"/>
<col/>
<col/>
<col width="140"/>
</colgroup>
<thead>
<tr>
<th>No.</th>
<th>Status</th>
<th>Location</th>
<th>Event Date</th>
</tr>
</thead>
</table>