使用BeautifulSoup解析嵌套的HTML列表

时间:2013-07-25 05:56:49

标签: python dictionary html-parsing beautifulsoup

我需要解析嵌套的HTML列表并将其转换为父子dict。鉴于此清单:

<ul>
  <li>Operating System
    <ul>
      <li>Linux
        <ul>
          <li>Debian</li>
          <li>Fedora</li>
          <li>Ubuntu</li>
        </ul>
      </li>
      <li>Windows</li>
      <li>OS X</li>
    </ul>
  </li>
  <li>Programming Languages
    <ul>
      <li>Python</li>
      <li>C#</li>
      <li>Ruby</li>
    </ul>
  </li>
</ul>

我想把它转换成这样的字典:

{
    'Operating System': {
        'Linux': {
            'Debian': None,
            'Fedora': None,
            'Ubuntu': None,
        },
        'Windows': None,
        'OS X': None,
    },
    'Programming Languages': {
        'Python': None,
        'C#': None,
        'Ruby': None,
    }
}

我最初的尝试是使用find_all('li', recursive=False)。它返回顶级项目(操作系统和编程语言),但也返回子项。

我怎样才能使用BeautifulSoup?

1 个答案:

答案 0 :(得分:8)

这是一种方式:

def dictify(ul):
    result = {}
    for li in ul.find_all("li", recursive=False):
        key = next(li.stripped_strings)
        ul = li.find("ul")
        if ul:
            result[key] = dictify(ul)
        else:
            result[key] = None
    return result

使用示例:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("""
... <ul>
...   <li>Operating System
...     <ul>
...       <li>Linux
...         <ul>
...           <li>Debian</li>
...           <li>Fedora</li>
...           <li>Ubuntu</li>
...         </ul>
...       </li>
...       <li>Windows</li>
...       <li>OS X</li>
...     </ul>
...   </li>
...   <li>Programming Languages
...     <ul>
...       <li>Python</li>
...       <li>C#</li>
...       <li>Ruby</li>
...     </ul>
...   </li>
... </ul>
... """)
>>> ul = soup.body.ul
>>> from pprint import pprint
>>> pprint(dictify(ul), width=1)
{u'Operating System': {u'Linux': {u'Debian': None,
                                  u'Fedora': None,
                                  u'Ubuntu': None},
                       u'OS X': None,
                       u'Windows': None},
 u'Programming Languages': {u'C#': None,
                            u'Python': None,
                            u'Ruby': None}}