beautifulsoup:获取所有表格数据

时间:2017-10-31 18:47:17

标签: python-3.x beautifulsoup

我正在使用django beautifulsoup来获取html表格中的所有数据。我有代码剥离表并将表数据保存为列表列表:

soup = bs.BeautifulSoup(html_source, 'lxml')
table = soup.find('table', {'id': 'detail'})
rows = table.findAll('tr')

data = [[td.findChildren(text=True) for td in tr.findAll(['th', 'td'])] for tr in rows]
data = [[u"".join(d).strip() for d in l] for l in data]

此代码到目前为止运行良好,但不知何故它不会捕获此html表的整个数据。它只获得thead行。我无法弄清楚为什么?

<table class="table_type1" data-tdborder="" id="detail">
   <colgroup>
      <col width="38">
      <col>
      <col>
      <col width="140">
   </colgroup>
   <thead>
      <tr>
         <th>No.</th>
         <th>Status</th>
         <th>Location</th>
         <th>Event Date</th>
      </tr>
   </thead>
   <tbody>
      <tr>
         <td style="text-align:center;">1</td>
         <td class="multi_row" style="line-height:15px;">Empty Container Release to Shipper</td>
         <td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA49')" title="GREATING FORTUNE (SHANGHAI) CONTAIN">GREATING FORTUNE (SHANGHAI) CONTAIN</a></td>
         <td class="ico_a">2017-10-09 10:51</td>
      </tr>
      <tr>
         <td style="text-align:center;">2</td>
         <td class="multi_row" style="line-height:15px;">Gate In to Outbound Terminal</td>
         <td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
         <td class="ico_a">2017-10-10 04:43</td>
      </tr>
      <tr>
         <td style="text-align:center;">3</td>
         <td class="multi_row" style="line-height:15px;">Loaded on 'NYK LYNX 2610E' at Port of Loading<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
         <td class="ico_a">2017-10-11 22:58</td>
      </tr>
      <tr>
         <td style="text-align:center;">4</td>
         <td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' Departure from Port of Loading<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">SHANGHAI, SHANGHAI ,CHINA<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CNSHA10')" title="SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)">SHANGHAI SHENDONG INTERNATIONAL CON (DXYS)</a></td>
         <td class="ico_a">2017-10-12 05:00</td>
      </tr>
      <tr>
         <td style="text-align:center;">5</td>
         <td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' Arrival at Port of Discharging<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
         <td class="ico_e">2017-11-14 21:00</td>
      </tr>
      <tr>
         <td style="text-align:center;">6</td>
         <td class="multi_row" style="line-height:15px;">'NYK LYNX 2610E' POD Berthing Destination<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
         <td class="ico_e">2017-11-14 22:00</td>
      </tr>
      <tr>
         <td style="text-align:center;">7</td>
         <td class="multi_row" style="line-height:15px;">Unloaded from 'NYK LYNX 2610E' at Port of Discharging<br> <a href="JavaScript:void(0);" style="line-height:15px;" title="NYK LYNX" data-click="vesselPop" data-cd="YNXT0260E">NYK LYNX 2610E</a></td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
         <td class="ico_e">2017-11-14 23:30</td>
      </tr>
      <tr>
         <td style="text-align:center;">8</td>
         <td class="multi_row" style="line-height:15px;">Gate Out from Inbound Terminal for Delivery to Consignee (or Port Shuttle)</td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br>  <a href="JavaScript:void(0);" style="line-height:15px;" onclick="openLocationPopup('CLVAP01')" title="TERMINAL PACIFICO SUR">TERMINAL PACIFICO SUR</a></td>
         <td class="ico_e">2017-11-15 04:00</td>
      </tr>
      <tr>
         <td style="text-align:center;">9</td>
         <td class="multi_row" style="line-height:15px;">Empty Container Returned from Customer</td>
         <td class="multi_row" style="line-height:15px;">VALPARAISO ,CHILE<br> </td>
         <td class="ico_e">2017-11-15 10:00</td>
      </tr>
   </tbody>
</table>

修改

我打印了soup个对象并浏览了所有html代码,令人惊讶的是它只包含表格的thead而不是tbody,这是一个错误boutifulsoup?这是beautifulsoup4捕获的表的唯一部分:

 <table class="table_type1" data-tdborder="" id="detail">
    <colgroup>
       <col width="38"/>
       <col/>
       <col/>
       <col width="140"/>
    </colgroup>
    <thead>
       <tr>
          <th>No.</th>
          <th>Status</th>
          <th>Location</th>
          <th>Event Date</th>
       </tr>
    </thead>
 </table>

0 个答案:

没有答案