Question

我正在尝试转换像这样的结构（一些嵌套的xml / html）

<div>a comment
  <div>an answer</div>
  <div>an answer
    <div>a reply</div>
    ...
  </div>
  ...
</div>
...

澄清：可以将其格式化为<div>a comment><div>an answer</div>或以任何其他方式（不是美化等）

（有多个不同深度的节点）

到具有父<ul>标签的相应列表结构（即普通的html列表）

<ul>
  <li>1
    <ul>
      <li>2</li>
      ...
   </ul>
  </li>
  ...
</ul>

我试着像这样使用BeautifulSoup：

from bs4 import BeautifulSoup as BS

bs = BS(source_xml)
for i in bs.find_all('div'):
    i.name = 'i'

# but it only replaces div tags to li tags, I still need to add ul tags

I can iterate through indentation levels like this, but I still can't figure how to separate a group of tags located on the same level to add the ul tag to them:
for i in bs.find_all('div', recursive=False):
    # how to wrap the following iterated items in 'ul' tag?
    for j in i.find_all('div', recursive=False):
         ...

如何在正确的位置添加<ul>代码？（我不关心漂亮的打印等，我需要有ul和li标签的有效html结构，tnx ......）

Answer 1

根据HTML的格式化方式，只需搜索没有封闭标记的开放标记（现在是ul的开头），open＆amp; amp;一起关闭标签（将是一个li），或者只是一个关闭标签（将是ul的结尾）。类似于下面的代码。为了使这个更强大，你可以使用BeautifulSoup的NavigableString

x = """<div>a comment
  <div>an answer</div>
  <div>an answer
    <div>a reply</div>
  </div>
</div>"""

xs = x.split("\n")


for tag in xs:
    if "<div" in tag and "</div" in tag:
        soup = BeautifulSoup(tag)
        html = "{}\n{}".format(html, "<li>{}</li>".format(soup.text))
    elif "<div" in tag:
        html = "{}\n{}".format(html, "<ul>\n<li>{}</li>".format(tag[tag.find(">") + 1:]))
    elif "</div" in tag:
        html = "{}\n{}".format(html, "</ul>")

如何使用python BeautifulSoup将html嵌套块解析为列表？

1 个答案: