我目前正在使用ASP创建的表,该表非常凌乱,但有一些代码帮助,我认为此表将为我提供所需的东西。 我有一个HTML代码,我希望输出与td的每个tr成为一个数组。我也不希望“-”成为数组输出的一部分。
某些td包含2个逗号,并且td中的某些文本仅由空白“”隔开:
代码是这样的
<tr bgcolor="#EFEFEF">
<td>
<a href="free.asp?detail=hide&c_id=4342141">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4342141
</td>
<td width="10">
</td>
<td>
25.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Golbasi Ankara, Turkey
</td>
<td width="10">
-
</td>
<td>
Konya Havalimani Turkey
</td>
<td colspan="2">
</td>
</tr>
<tr bgcolor="#EFEFEF" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DDDDDD" height="6">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#FFFFFF" height="1">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7" height="3">
<td colspan="10">
</td>
</tr>
<tr bgcolor="#DEE3E7">
<td>
<a href="free.asp?detail=hide&c_id=4134123">
<img align="absmiddle" border="0" hspace="0" src="pic/bullet.gif" vspace="0"/>
</a>
</td>
<td>
4134123
</td>
<td width="10">
</td>
<td>
26.07.2018 09:00
</td>
<td width="10">
</td>
<td>
Kucuktepe, Van, Turkey
</td>
<td width="10">
-
</td>
<td>
Maltepe, Istanbul, Turkey
</td>
<td colspan="2">
</td>
</tr>
某些td包含2个逗号,并且td中的某些文本仅由空白“”隔开:
[['4342141', '25.07.2018', '09:00', 'Golbasi Ankara, Turkey', '-', 'Konya Havalimani Turkey', 'free.asp?detail=hide&c_id=4342141'], ['4134123', '26.07.2018', '09:00', 'Kucuktepe, Van, Turkey', '-', 'Maltepe, Istanbul, Turkey', 'free.asp?detail=hide&c_id=4134123']]
答案 0 :(得分:0)
假设data
将保留HTML文本:
from bs4 import BeautifulSoup
from pprint import pprint
soup = BeautifulSoup(data, 'lxml')
rows = []
for tr in soup.select('tr'):
row = [td.text.strip() for td in tr.select('td') if td.text.strip() and td.text.strip() != '-']
if row:
rows.append(row)
pprint(rows, width=120)
这将打印:
[['4342141', '25.07.2018 09:00', 'Golbasi Ankara, Turkey', 'Konya Havalimani Turkey'],
['4134123', '26.07.2018 09:00', 'Kucuktepe, Van, Turkey', 'Maltepe, Istanbul, Turkey']]
要将rows
列表写入csv,可以使用以下脚本:
import csv
with open('data.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(rows)
然后在data.csv
文件中,您将拥有:
4342141,25.07.2018 09:00,"Golbasi Ankara, Turkey",Konya Havalimani Turkey
4134123,26.07.2018 09:00,"Kucuktepe, Van, Turkey","Maltepe, Istanbul, Turkey"