BeautifulSoup,获取tr内所有td的文本(带有逗号的一些文本)

时间:2018-08-04 12:27:30

标签: python html python-3.x beautifulsoup lxml

我目前正在使用ASP创建的表,该表非常凌乱,但有一些代码帮助,我认为此表将为我提供所需的东西。 我有一个HTML代码,我希望输出与td的每个tr成为一个数组。我也不希望“-”成为数组输出的一部分。

某些td包含2个逗号,并且td中的某些文本仅由空白“”隔开:

代码是这样的

  <tr bgcolor="#EFEFEF">
  <td>
   <a href="free.asp?detail=hide&amp;c_id=4342141">
    <img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
   </a>
  </td>
  <td>
   4342141
  </td>
  <td width="10">
  </td>
  <td>
   25.07.2018 09:00
  </td>
  <td width="10">
  </td>
  <td>
   Golbasi Ankara, Turkey
  </td>
  <td width="10">
   -
  </td>
  <td>
   Konya Havalimani Turkey
  </td>
  <td colspan="2">
  </td>
 </tr>
 <tr bgcolor="#EFEFEF" height="3">
  <td colspan="10">
  </td>
 </tr>
 <tr bgcolor="#FFFFFF" height="1">
  <td colspan="10">
  </td>
 </tr>
 <tr bgcolor="#DDDDDD" height="6">
  <td colspan="10">
  </td>
 </tr>
 <tr bgcolor="#FFFFFF" height="1">
  <td colspan="10">
  </td>
 </tr>
 <tr bgcolor="#DEE3E7" height="3">
  <td colspan="10">
  </td>
 </tr>
 <tr bgcolor="#DEE3E7">
  <td>
   <a href="free.asp?detail=hide&amp;c_id=4134123">
    <img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
   </a>
  </td>
  <td>
   4134123
  </td>
  <td width="10">
  </td>
  <td>
   26.07.2018 09:00
  </td>
  <td width="10">
  </td>
  <td>
   Kucuktepe, Van, Turkey
  </td>
  <td width="10">
   -
  </td>
  <td>
   Maltepe, Istanbul, Turkey
  </td>
  <td colspan="2">
  </td>
 </tr>

某些td包含2个逗号,并且td中的某些文本仅由空白“”隔开:

[['4342141', '25.07.2018', '09:00', 'Golbasi Ankara, Turkey', '-', 'Konya Havalimani Turkey', 'free.asp?detail=hide&c_id=4342141'], ['4134123', '26.07.2018', '09:00', 'Kucuktepe, Van, Turkey', '-', 'Maltepe, Istanbul, Turkey', 'free.asp?detail=hide&c_id=4134123']]

1 个答案:

答案 0 :(得分:0)

假设data将保留HTML文本:

from bs4 import BeautifulSoup
from pprint import pprint

soup = BeautifulSoup(data, 'lxml')
rows = []
for tr in soup.select('tr'):
    row = [td.text.strip() for td in tr.select('td') if td.text.strip() and td.text.strip() != '-']
    if row:
        rows.append(row)

pprint(rows, width=120)

这将打印:

[['4342141', '25.07.2018 09:00', 'Golbasi Ankara, Turkey', 'Konya Havalimani Turkey'],
 ['4134123', '26.07.2018 09:00', 'Kucuktepe, Van, Turkey', 'Maltepe, Istanbul, Turkey']]

要将rows列表写入csv,可以使用以下脚本:

import csv

with open('data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerows(rows)

然后在data.csv文件中,您将拥有:

4342141,25.07.2018 09:00,"Golbasi Ankara, Turkey",Konya Havalimani Turkey
4134123,26.07.2018 09:00,"Kucuktepe, Van, Turkey","Maltepe, Istanbul, Turkey"