如何使用python

时间:2016-02-13 17:26:51

标签: python html parsing

我正在编写一个python脚本来解析包含表的html文件。这是我要解析的文件的一个示例:

<table border="0" cellspacing="1" cellpadding="0" width="3080">
<tr>
<th width="50"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 1</font></small></th>
<th width="130" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 2</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 3</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 4</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 5</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 6</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 7</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 8</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 9</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 10</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 11</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 12</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 13</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 14</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 15</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 16</font></small></th>
<th width="60"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 17</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 18</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 19</font></small></th>
<th width="95" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 20</font></small></th>
<th width="95" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 21</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 22>/font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 23</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 24</font></small></th>
<th width="80" height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 25</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 26</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 27</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 28</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 29</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 30</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 31</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 32</font></small></th>
<th width="80"  height="20" bgcolor="#A55592"><small><font face="Arial" color="#FFFFFF">Header 33</font></small></th>
</tr>

<tr bgcolor=#D5BCCD>
<td rowspan="5">1</td>
<td rowspan="5">01/02/2016</td>
<td rowspan="5">18</td>
<td rowspan="5">20</td>
<td rowspan="5">25</td>
<td rowspan="5">23</td>
<td rowspan="5">10</td>
<td rowspan="5">11</td>
<td rowspan="5">24</td>
<td rowspan="5">14</td>
<td rowspan="5">06</td>
<td rowspan="5">02</td>
<td rowspan="5">13</td>
<td rowspan="5">09</td>
<td rowspan="5">05</td>
<td rowspan="5">16</td>
<td rowspan="5">03</td>
<td rowspan="5">Next value indicates number of rows to skip</td>
<td rowspan="5">5</td>
<td></td>
<td>XA</td>
<td rowspan="5">15</td>
<td rowspan="5">46</td>
<td rowspan="5">48</td>
<td rowspan="5">25</td>
<td rowspan="5">49</td>
<td rowspan="5">68</td>
<td rowspan="5">10</td>
<td rowspan="5">40</td>
<td rowspan="5">20</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
</tr>
<tr bgcolor=#D5BCCD><td></td><td>XB</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XC</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XD</td></tr>
<tr bgcolor=#D5BCCD><td></td><td>XE</td></tr>

<tr>
<td rowspan="1">2</td>
<td rowspan="1">02/02/2016</td>
<td rowspan="1">23</td>
<td rowspan="1">15</td>
<td rowspan="1">05</td>
<td rowspan="1">04</td>
<td rowspan="1">12</td>
<td rowspan="1">16</td>
<td rowspan="1">20</td>
<td rowspan="1">06</td>
<td rowspan="1">11</td>
<td rowspan="1">19</td>
<td rowspan="1">24</td>
<td rowspan="1">01</td>
<td rowspan="1">09</td>
<td rowspan="1">13</td>
<td rowspan="1">07</td>
<td rowspan="1">Next value indicates number of rows to skip</td>
<td rowspan="1">1</td>
<td></td>
<td>XA</td>
<td rowspan="1">184</td>
<td rowspan="1">6232</td>
<td rowspan="1">81252</td>
<td rowspan="1">478188</td>
<td rowspan="1">596.323,70</td>
<td rowspan="1">1.388,95</td>
<td rowspan="1">10,00</td>
<td rowspan="1">4,00</td>
<td rowspan="1">2,00</td>
<td rowspan="1">0,00</td>
<td rowspan="1">0,00</td>
<td rowspan="1">0,00</td>
</tr>

<tr bgcolor=#D5BCCD>
<td rowspan="5">3</td>
<td rowspan="5">04/02/2016</td>
<td rowspan="5">18</td>
<td rowspan="5">20</td>
<td rowspan="5">25</td>
<td rowspan="5">23</td>
<td rowspan="5">10</td>
<td rowspan="5">11</td>
<td rowspan="5">24</td>
<td rowspan="5">14</td>
<td rowspan="5">06</td>
<td rowspan="5">02</td>
<td rowspan="5">13</td>
<td rowspan="5">09</td>
<td rowspan="5">05</td>
<td rowspan="5">16</td>
<td rowspan="5">03</td>
<td rowspan="5">Next value indicates number of rows to skip</td>
<td rowspan="5">2</td>
<td></td>
<td>XA</td>
<td rowspan="5">15</td>
<td rowspan="5">46</td>
<td rowspan="5">48</td>
<td rowspan="5">25</td>
<td rowspan="5">49</td>
<td rowspan="5">68</td>
<td rowspan="5">10</td>
<td rowspan="5">40</td>
<td rowspan="5">20</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
<td rowspan="5">000</td>
</tr>
<tr bgcolor=#D5BCCD><td></td><td>XB</td></tr>
</table>

这是我写的解析它的脚本:

# Parse the data
soup = BeautifulSoup(file(result_file))
table = soup.find('table')

# The first tr contains the field names.
headings = [th.get_text() for th in table.find('tr').find_all('th')]
important_headings = headings[:19]

all_tr = table.find_all('tr')
count = 1
data_sets = []
while count < len(all_tr):
    date_results = all_tr[count].find_all('td')
    skip_rows = int(date_results[18].get_text())
    count += skip_rows
    data_set = zip(important_headings, (td.get_text() for td in date_results[:19]))
    data_sets.append(data_set)

# Write the csv file
with open(csv_file, 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(data_sets)

它可以工作,但解析7行大约需要30毫秒。真正的html文件上的表有大约1300行,因此需要一段时间来解析它。如果是这样,因为通常该过程在完成之前崩溃。

如何让它表现更好?

更新(分析信息):

这是在算法的每个部分上花费的时间:

  • 完成查找表。花了32.11秒才完成。
  • 找到文件中的所有tr。花了103.414059ms来完成。

while循环部分

  • 完成了在tr内找到所有td。耗时0.142097ms完成。
  • 完成了跳过行。耗时0.020027ms完成。
  • 完成拉链。花了0.100851ms来完成。
  • 完成追加。完成0.001907ms。

3 个答案:

答案 0 :(得分:0)

尝试使用本机C / C ++解析库的python绑定,例如libxml(这显然需要从美容汤的便利性中退一步。)

答案 1 :(得分:0)

它在哪里崩溃?我猜,它在soup = ...。如果是这样,那么您最好不要实现SAX解析器,而不是在BeautifulSoup中构建整个DOM。鉴于HTML源代码的严格结构,您甚至可以考虑对该结构进行手动编码并执行类似

的操作
for line in ...:
    if line.startswith("<td"):
        ...
        td = line.split('">')[1].split('</')[0]
    ...

但这很大程度上取决于HTML页面随时间变化的程度。 &#34;真&#34;解析可能更好。

更新:

在最后一行用rowspan修复问题后,快速生成1300个显示行(len(all_tr) == 3465),作为OP给出的433个演示数据副本。那么HTML文件大小是1.2MiB。

该脚本在我的机器(i7 4核联想X230,Ubuntu 14.04)上整体运行2.6秒,bs4为2.3秒,消耗131 MiB内存,bs4消耗127 MiB。 Raspberry Pi 2B需要64s @ 700 MHz,顺便说一句。我使用Python 2.7.6bs4 4.2.1使用默认的XML解析器。运行时间由cProfile测量,内存消耗量为memory_profiler。关于zip(),我们已经有了一个评论......

学习OP脚本死亡的环境会很有趣。

答案 2 :(得分:0)

尝试pandas

for a in gen:
   f1(a)
   f2(a)

output.csv:

from __future__ import print_function
import pandas as pd

with open('data.html', 'r') as f:
    data = f.read()

# parse 1st HTML table to pandas.DataFrame
df = pd.read_html(data, header=0)[0]

# drop unimportant columns
df = df.drop(df.columns[[1, 18]], axis=1).dropna(how='all')

# write the CSV file
df.to_csv('output.csv', index=False)

# print(df)

请告诉我们您的真实HTML文件有多快/慢。