我正在尝试转换像这样的结构(一些嵌套的xml / html)
<div>a comment
<div>an answer</div>
<div>an answer
<div>a reply</div>
...
</div>
...
</div>
...
澄清:可以将其格式化为<div>a comment><div>an answer</div>
或以任何其他方式(不是美化等)
(有多个不同深度的节点)
到具有父<ul>
标签的相应列表结构(即普通的html列表)
<ul>
<li>1
<ul>
<li>2</li>
...
</ul>
</li>
...
</ul>
我试着像这样使用BeautifulSoup:
from bs4 import BeautifulSoup as BS
bs = BS(source_xml)
for i in bs.find_all('div'):
i.name = 'i'
# but it only replaces div tags to li tags, I still need to add ul tags
I can iterate through indentation levels like this, but I still can't figure how to separate a group of tags located on the same level to add the ul tag to them:
for i in bs.find_all('div', recursive=False):
# how to wrap the following iterated items in 'ul' tag?
for j in i.find_all('div', recursive=False):
...
如何在正确的位置添加<ul>
代码? (我不关心漂亮的打印等,我需要有ul和li标签的有效html结构,tnx ......)
答案 0 :(得分:1)
根据HTML的格式化方式,只需搜索没有封闭标记的开放标记(现在是ul的开头),open&amp; amp;一起关闭标签(将是一个li),或者只是一个关闭标签(将是ul的结尾)。类似于下面的代码。为了使这个更强大,你可以使用BeautifulSoup的NavigableString
x = """<div>a comment
<div>an answer</div>
<div>an answer
<div>a reply</div>
</div>
</div>"""
xs = x.split("\n")
for tag in xs:
if "<div" in tag and "</div" in tag:
soup = BeautifulSoup(tag)
html = "{}\n{}".format(html, "<li>{}</li>".format(soup.text))
elif "<div" in tag:
html = "{}\n{}".format(html, "<ul>\n<li>{}</li>".format(tag[tag.find(">") + 1:]))
elif "</div" in tag:
html = "{}\n{}".format(html, "</ul>")