我想通过使用基本的urllib2和.read()
来解析一个URL,其部分输出如下面的html行所示<hr>
<h2>Cluster Summary (Heap Size is 555 MB/26.6 GB)</h2>
< table border="1" cellpadding="5" cellspacing="0">
< tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</th><th>MapTask Prefetch Capacity</th></tr>
< tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></table>
< br>
< hr>
现在从上面的HTML中获取有意义的信息我试图使用HTMLParser(在我的环境中没有提供像Beautifulsoup,lxml,pyquery这样的研究之后显示的其他选项,我没有sudo来安装它)
预期输出是一个带有分隔符的文件,比如逗号
Running Map Tasks,31
Running Reduce Tasks,0
Total Submissions,5587
Nodes,8
Occupied Map Slots,31
Occupied Reduce Slots,0
Reserved Map Slots,0
Reserved Reduce Slots,0
Map Task Capacity,352
Reduce Task Capacity,128
Avg. Tasks/Node,60.00
Blacklisted Nodes,0
Excluded Nodes,0
MapTask Prefetch Capacity ,0
请检查
***********更新******** 如果我选择其他选项,如beautifulsoup,它是否允许限制搜索特定的块,例如群集摘要,因为我的html将有不同的部分
答案 0 :(得分:0)
如果您真的必须“想要”手动“执行此操作,则可以使用字符串操作和/或正则表达式。假设您通过迭代HTML找到了相关的行:
import re
line1 = "< tr><th>Running Map Tasks</th><th>Running Reduce Tasks</ th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</ th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</ th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</ th><th>MapTask Prefetch Capacity</th></tr>"
line2 = '< tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp? type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</ td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</ a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></ table>'
headers= (line1.replace('< tr>', '').
replace("</tr>", "").
replace("<th>", "").
split("</th>"))
matches = re.findall("<td>(.+?)</td>", line2)
clean_matches = []
for m in matches:
if m.startswith('<'):
clean_matches.append(re.search('>(.+?)<', m).group(1))
else:
clean_matches.append(m)
for h, m in zip(headers, clean_matches):
print("{}: {}".format(h, m))
对于第一行,我使用.replace()和.split()来删除标记并在正确的位置拆分。对于“数据”行,我使用regular expressions来获取<td>
的内容。如果内容以标记开头,则正则表达式搜索<td>
中的第一个文本节点。与往常一样,如果服务器格式化输出只是略有不同,这段代码非常脆弱并且很容易破解。
如果您收到zero length field name
错误,则表示您使用的是旧版Python。假设您无法更改此设置,则必须将print
函数/语句更改为使用{0}: {1}
或print "%s: %s" % (h, m)
。