解析URL以生成分隔文件

时间:2015-02-16 22:02:20

标签: python parsing

我想通过使用基本的urllib2和.read()

来解析一个URL,其部分输出如下面的html行所示
<hr>
<h2>Cluster Summary (Heap Size is 555 MB/26.6 GB)</h2>
< table border="1" cellpadding="5" cellspacing="0">
< tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</th><th>MapTask Prefetch Capacity</th></tr>
< tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></table>
< br>
< hr>

现在从上面的HTML中获取有意义的信息我试图使用HTMLParser(在我的环境中没有提供像Beautifulsoup,lxml,pyquery这样的研究之后显示的其他选项,我没有sudo来安装它)

预期输出是一个带有分隔符的文件,比如逗号

Running Map Tasks,31
Running Reduce Tasks,0
Total Submissions,5587
Nodes,8
Occupied Map Slots,31
Occupied Reduce Slots,0 
Reserved Map Slots,0 
Reserved Reduce Slots,0 
Map Task Capacity,352 
Reduce Task Capacity,128 
Avg. Tasks/Node,60.00 
Blacklisted Nodes,0 
Excluded Nodes,0 
MapTask Prefetch Capacity ,0

请检查

***********更新******** 如果我选择其他选项,如beautifulsoup,它是否允许限制搜索特定的块,例如群集摘要,因为我的html将有不同的部分

1 个答案:

答案 0 :(得分:0)

如果您真的必须“想要”手动“执行此操作,则可以使用字符串操作和/或正则表达式。假设您通过迭代HTML找到了相关的行:

import re

line1 = "< tr><th>Running Map Tasks</th><th>Running Reduce Tasks</           th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</          th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved     Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</        th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</    th><th>MapTask Prefetch Capacity</th></tr>"

line2 = '< tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?    type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</     td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</ a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></  table>'


headers= (line1.replace('< tr>', '').
  replace("</tr>", "").
  replace("<th>", "").
  split("</th>"))

matches = re.findall("<td>(.+?)</td>", line2)
clean_matches = []
for m in matches:
    if m.startswith('<'):
        clean_matches.append(re.search('>(.+?)<', m).group(1))
    else:
        clean_matches.append(m)

for h, m in zip(headers, clean_matches):
    print("{}: {}".format(h, m))

对于第一行,我使用.replace().split()来删除标记并在正确的位置拆分。对于“数据”行,我使用regular expressions来获取<td>的内容。如果内容以标记开头,则正则表达式搜索<td>中的第一个文本节点。与往常一样,如果服务器格式化输出只是略有不同,这段代码非常脆弱并且很容易破解。

如果您收到zero length field name错误,则表示您使用的是旧版Python。假设您无法更改此设置,则必须将print函数/语句更改为使用{0}: {1}print "%s: %s" % (h, m)

另见the python documentation on string formatting