如果我有一个嵌套的html(无序)列表,如下所示:
<ul>
<li><a href="Page1_Level1.html">Page1_Level1</a>
<ul>
<li><a href="Page1_Level2.html">Page1_Level2</a>
<ul>
<li><a href="Page1_Level3.html">Page1_Level3</a></li>
</ul>
<ul>
<li><a href="Page2_Level3.html">Page2_Level3</a></li>
</ul>
<ul>
<li><a href="Page3_Level3.html">Page3_Level3</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="Page2_Level1.html">Page2_Level1</a>
<ul>
<li><a href="Page2_Level2.html">Page2_Level2</a></li>
</ul>
</li>
</ul>
如何在Python中构建嵌套列表?例如:
["Page1_Level1.html", ["Page1_Level2.html", ["Page1_Leve3.html", "Page2_Level3.html", "Page3_Level3.html"]], "Page2_Level1.html", ["Page2_Level2.html"]]
我认为像美丽的汤和 HTML Parser 这样的图书馆有设施可以做到这一点,但我还没有能够弄明白。感谢您的帮助/指点!
答案 0 :(得分:3)
您可以采用递归方式:
from pprint import pprint
from bs4 import BeautifulSoup
text = """your html goes here"""
def find_li(element):
return [{li.a['href']: find_li(li)}
for ul in element('ul', recursive=False)
for li in ul('li', recursive=False)]
soup = BeautifulSoup(text, 'html.parser')
data = find_li(soup)
pprint(data)
打印:
[{u'Page1_Level1.html': [{u'Page1_Level2.html': [{u'Page1_Level3.html': []},
{u'Page2_Level3.html': []},
{u'Page3_Level3.html': []}]}]},
{u'Page2_Level1.html': [{u'Page2_Level2.html': []}]}]
仅供参考,这就是为什么我必须在这里使用html.parser
:
答案 1 :(得分:1)
概述了可能的解决方案
# variable 'markup' contains the html string
from bs4 import BeautifulSoup
soup = BeautifulSoup(markup)
for a in soup.descendants:
# construct a nested list when going thru the descendants
print id(a), id(a.parent) if a.parent else None, a