如何使用python BeautifulSoup将html嵌套块解析为列表?

时间:2014-10-07 16:32:45

标签: python beautifulsoup

我正在尝试转换像这样的结构(一些嵌套的xml / html)

<div>a comment
  <div>an answer</div>
  <div>an answer
    <div>a reply</div>
    ...
  </div>
  ...
</div>
...

澄清:可以将其格式化为<div>a comment><div>an answer</div>或以任何其他方式(不是美化等)

(有多个不同深度的节点)

到具有父<ul>标签的相应列表结构(即普通的html列表)

<ul>
  <li>1
    <ul>
      <li>2</li>
      ...
   </ul>
  </li>
  ...
</ul>

我试着像这样使用BeautifulSoup:

from bs4 import BeautifulSoup as BS

bs = BS(source_xml)
for i in bs.find_all('div'):
    i.name = 'i'

# but it only replaces div tags to li tags, I still need to add ul tags

I can iterate through indentation levels like this, but I still can't figure how to separate a group of tags located on the same level to add the ul tag to them:
for i in bs.find_all('div', recursive=False):
    # how to wrap the following iterated items in 'ul' tag?
    for j in i.find_all('div', recursive=False):
         ...

如何在正确的位置添加<ul>代码? (我不关心漂亮的打印等,我需要有ul和li标签的有效html结构,tnx ......)

1 个答案:

答案 0 :(得分:1)

根据HTML的格式化方式,只需搜索没有封闭标记的开放标记(现在是ul的开头),open&amp; amp;一起关闭标签(将是一个li),或者只是一个关闭标签(将是ul的结尾)。类似于下面的代码。为了使这个更强大,你可以使用BeautifulSoup的NavigableString

x = """<div>a comment
  <div>an answer</div>
  <div>an answer
    <div>a reply</div>
  </div>
</div>"""

xs = x.split("\n")


for tag in xs:
    if "<div" in tag and "</div" in tag:
        soup = BeautifulSoup(tag)
        html = "{}\n{}".format(html, "<li>{}</li>".format(soup.text))
    elif "<div" in tag:
        html = "{}\n{}".format(html, "<ul>\n<li>{}</li>".format(tag[tag.find(">") + 1:]))
    elif "</div" in tag:
        html = "{}\n{}".format(html, "</ul>")