我正在尝试使用BeautifulSoup和python从网站解析信息。 html如下所示。我希望我的解析数据看起来像:
ID定义
Lysine.biosynthesis - Burkholderia psuedomallei 17
...其余数据位于类似位置(在“pre”标签内和“a”标签之外。
我该怎么做?
<pre>ID Definition
----------------------------------------------------------------------------------------------------
<a href="/kegg-bin/show_pathway?bpm00300">bpm00300</a> Lysine biosynthesis - Burkholderia pseudomallei 17
<a href="/kegg-bin/show_pathway?bpm00330">bpm00330</a> Arginine and proline metabolism - Burkholderia pse
<a href="/kegg-bin/show_pathway?bpm01100">bpm01100</a> Metabolic pathways - Burkholderia pseudomallei 171
<a href="/kegg-bin/show_pathway?bpm01110">bpm01110</a> Biosynthesis of secondary metabolites - Burkholder
</pre>
我试过:
y=soup.find('pre') #returns data between <pre> tags. Specific to KEGG
for a in y:
z =a.string
这给了我:
ID Definition
----------------------------------------------------------------------------------------------------
感谢您的帮助!
答案 0 :(得分:1)
BeautifulSoup()及其搜索方法return you a hierarchical parse-tree object,而不仅仅是一个字符串。通过找到的节点上的findChildren()进行迭代可以执行您想要的操作(并跳过标题行):
for a in soup.find('pre').findChildren():
z = a.string