我正在编写一个python脚本来解析包含表的html文件。这是我要解析的文件的一个示例:
<table border="0" cellspacing="1" cellpadding="0" width="3080">
<tr>
<th width="50" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 1</font></small></th>
<th width="130" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 2</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 3</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 4</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 5</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 6</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 7</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 8</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 9</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 10</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 11</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 12</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 13</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 14</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 15</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 16</font></small></th>
<th width="60" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 17</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 18</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 19</font></small></th>
<th width="95" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 20</font></small></th>
<th width="95" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 21</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 22>/font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 23</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 24</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 25</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 26</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 27</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 28</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 29</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 30</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 31</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 32</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 33</font></small></th>
</tr>
<tr bgcolor=#D5BCCD>
<td rowspan="5">1</td>
<td rowspan="5">01/02/2016</td>
<td rowspan="5">18</td>
<td rowspan="5">20</td>
<td rowspan="5">25</td>
<td rowspan="5">23</td>
<td rowspan="5">10</td>
<td rowspan="5">11</td>
<td rowspan="5">24</td>
<td rowspan="5">14</td>
<td rowspan="5">06</td>
<td rowspan="5">02</td>
<td rowspan="5">13</td>
<td rowspan="5">09</td>
<td rowspan="5">05</td>
<td rowspan="5">16</td>
<td rowspan="5">03</td>
<td rowspan="5">Next value indicates number of rows to skip</td>
<td rowspan="5">5</td>
<td></td>
<td>XA</td>
<td rowspan="5">15</td>
<td rowspan="5">46</td>
<td rowspan="5">48</td>
<td rowspan="5">25</td>
<td rowspan="5">49</td>
<td rowspan="5">68</td>
<td rowspan="5">10</td>
<td rowspan="5">40</td>
<td rowspan="5">20</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
</tr>
<tr bgcolor=#D5BCCD><td></td><td>XB</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XC</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XD</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XE</td></tr>
<tr>
<td rowspan="1">2</td>
<td rowspan="1">02/02/2016</td>
<td rowspan="1">23</td>
<td rowspan="1">15</td>
<td rowspan="1">05</td>
<td rowspan="1">04</td>
<td rowspan="1">12</td>
<td rowspan="1">16</td>
<td rowspan="1">20</td>
<td rowspan="1">06</td>
<td rowspan="1">11</td>
<td rowspan="1">19</td>
<td rowspan="1">24</td>
<td rowspan="1">01</td>
<td rowspan="1">09</td>
<td rowspan="1">13</td>
<td rowspan="1">07</td>
<td rowspan="1">Next value indicates number of rows to skip</td>
<td rowspan="1">1</td>
<td></td>
<td>XA</td>
<td rowspan="1">184</td>
<td rowspan="1">6232</td>
<td rowspan="1">81252</td>
<td rowspan="1">478188</td>
<td rowspan="1">596.323,70</td>
<td rowspan="1">1.388,95</td>
<td rowspan="1">10,00</td>
<td rowspan="1">4,00</td>
<td rowspan="1">2,00</td>
<td rowspan="1">0,00</td>
<td rowspan="1">0,00</td>
<td rowspan="1">0,00</td>
</tr>
<tr bgcolor=#D5BCCD>
<td rowspan="5">3</td>
<td rowspan="5">04/02/2016</td>
<td rowspan="5">18</td>
<td rowspan="5">20</td>
<td rowspan="5">25</td>
<td rowspan="5">23</td>
<td rowspan="5">10</td>
<td rowspan="5">11</td>
<td rowspan="5">24</td>
<td rowspan="5">14</td>
<td rowspan="5">06</td>
<td rowspan="5">02</td>
<td rowspan="5">13</td>
<td rowspan="5">09</td>
<td rowspan="5">05</td>
<td rowspan="5">16</td>
<td rowspan="5">03</td>
<td rowspan="5">Next value indicates number of rows to skip</td>
<td rowspan="5">2</td>
<td></td>
<td>XA</td>
<td rowspan="5">15</td>
<td rowspan="5">46</td>
<td rowspan="5">48</td>
<td rowspan="5">25</td>
<td rowspan="5">49</td>
<td rowspan="5">68</td>
<td rowspan="5">10</td>
<td rowspan="5">40</td>
<td rowspan="5">20</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
</tr>
<tr bgcolor=#D5BCCD><td></td><td>XB</td></tr>
</table>
这是我写的解析它的脚本:
# Parse the data
soup = BeautifulSoup(file(result_file))
table = soup.find('table')
# The first tr contains the field names.
headings = [th.get_text() for th in table.find('tr').find_all('th')]
important_headings = headings[:19]
all_tr = table.find_all('tr')
count = 1
data_sets = []
while count < len(all_tr):
date_results = all_tr[count].find_all('td')
skip_rows = int(date_results[18].get_text())
count += skip_rows
data_set = zip(important_headings, (td.get_text() for td in date_results[:19]))
data_sets.append(data_set)
# Write the csv file
with open(csv_file, 'wb') as f:
writer = csv.writer(f)
writer.writerows(data_sets)
它可以工作,但解析7行大约需要30毫秒。真正的html文件上的表有大约1300行,因此需要一段时间来解析它。如果是这样,因为通常该过程在完成之前崩溃。
如何让它表现更好?
更新(分析信息):
这是在算法的每个部分上花费的时间:
while循环部分
答案 0 :(得分:0)
尝试使用本机C / C ++解析库的python绑定,例如libxml(这显然需要从美容汤的便利性中退一步。)
答案 1 :(得分:0)
它在哪里崩溃?我猜,它在soup = ...
。如果是这样,那么您最好不要实现SAX解析器,而不是在BeautifulSoup
中构建整个DOM。鉴于HTML源代码的严格结构,您甚至可以考虑对该结构进行手动编码并执行类似
for line in ...:
if line.startswith("<td"):
...
td = line.split('">')[1].split('</')[0]
...
但这很大程度上取决于HTML页面随时间变化的程度。 &#34;真&#34;解析可能更好。
更新:
在最后一行用rowspan
修复问题后,快速生成1300个显示行(len(all_tr) == 3465
),作为OP给出的433个演示数据副本。那么HTML文件大小是1.2MiB。
该脚本在我的机器(i7 4核联想X230,Ubuntu 14.04)上整体运行2.6秒,bs4
为2.3秒,消耗131 MiB内存,bs4
消耗127 MiB。 Raspberry Pi 2B需要64s @ 700 MHz,顺便说一句。我使用Python 2.7.6
和bs4 4.2.1
使用默认的XML解析器。运行时间由cProfile
测量,内存消耗量为memory_profiler。关于zip()
,我们已经有了一个评论......
学习OP脚本死亡的环境会很有趣。
答案 2 :(得分:0)
尝试pandas:
for a in gen:
f1(a)
f2(a)
output.csv:
from __future__ import print_function
import pandas as pd
with open('data.html', 'r') as f:
data = f.read()
# parse 1st HTML table to pandas.DataFrame
df = pd.read_html(data, header=0)[0]
# drop unimportant columns
df = df.drop(df.columns[[1, 18]], axis=1).dropna(how='all')
# write the CSV file
df.to_csv('output.csv', index=False)
# print(df)
请告诉我们您的真实HTML文件有多快/慢。