Python bs4嵌套列表到达末尾而无需访问先前的列表元素

时间:2019-07-24 06:32:55

标签: python parsing beautifulsoup

我具有以下html结构,可通过python通过bs4进行解析。

<div class="sidebar-widget-content" id="atc-wrapper">
                <ul class="lists-rundown">

                        <li>
                            <a class="atc-group atc-group-active" href="" data-url="/atc-kodlari/1">
                                <i class="fa fa-lg fa-pulse fa-spinner atc-group-loading" style="margin-right: 5px; display: none;"></i>
                                A - Gastrointestinal kanal ve metabolizma
                                <span class="lists-rundown-no">(16)</span>
                            </a>
                            <ul style="margin-left: 40px; display: block;" class="atc-group-children">
                            <li style="border:1px solid #e1e1e1;"><a 
                            class="atc-group atc-group-active" href="" data- 
                            url="/atc-kodlari/2"><i class="fa fa-lg fa-pulse 
                            fa-spinner atc-group-loading" style="margin- 
                            right: 5px; display: none;"></i>A01 - 
                            Stomatolojik preparatlar<span class="lists- 
                            rundown-no">(1)</span></a><ul style="margin- 
                            left: 40px; display: block;" class="atc-group- 
                            children"><li style="border:1px solid #e1e1e1;"> 
                            <a class="atc-group" href="" data-url="/atc- 
                            kodlari/3"><i class="fa fa-lg fa-pulse fa- 
                            spinner atc-group-loading" style="margin- 
                            right:5px;display:none;"></i>A01A - Stomatolojik 
                            preparatlar<span class="lists-rundown-no">(4) 
                            </span></a><ul style="margin-left:40px;" 
                            class="atc-group-children"></ul></li></ul></li>
                            </ul>
                        </li>

                        <li>
                            <a class="atc-group" href="" data-url="/atc-kodlari/729">
                                <i class="fa fa-lg fa-pulse fa-spinner atc-group-loading" style="margin-right:5px;display:none;"></i>
                                B - Kan ve kan yapıcı organlar
                                <span class="lists-rundown-no">(5)</span>
                            </a>
                            <ul style="margin-left:40px;" class="atc-group-children">
                            </ul>
                        </li>     
                </ul>
            </div>

Structure

我正试图将其解析为

def find_text(ul):
    for li in ul.find_all("li"):
        key = next(li.stripped_strings)
        print(key)
        ul = li.select_one("ul")
        if ul:
            find_text(ul)

find_text(source.select_one("#atc-wrapper > ul"))

打印内容

A-胃肠道kanal ve代谢代谢

B-Kan ve kanyapıcı细胞膜

C-KARDİYOVASKÜLERSİSTEM

D-DERMATOLOJİDEKULLANILANİLAÇLAR

G-GENİTOÜRİNERSİSTEMVE SEKS HORMONLARI     ...

它不打印内部li文字应该是

A - 
  A01 -
    A01A -

跳过中间层的原因可能是什么?

发现了问题。原始资源中不存在折叠的列表元素,因此无法获取。如何获取折叠数据。

1 个答案:

答案 0 :(得分:0)

经过一段时间的尝试(即使最漂亮的汤有时也可能是不透明的:-),我认为我的代码可以正常工作了:

def find_text(ul, depth):
    li= ul.find('li')
    while li is not None:
        key = next(li.stripped_strings)
        print(f'{depth}: {key}')
        children= li.children
        for child in children:
            if hasattr(child, 'stripped_strings'):
                find_text(child, depth+1)
        li= li.find_next_sibling('li')

find_text(soup, 0)

这将输出:

0: A - Gastrointestinal kanal ve metabolizma
1: A01 - 
                            Stomatolojik preparatlar
2: A01A - Stomatolojik 
                            preparatlar
0: B - Kan ve kan yapıcı organlar

如果您不希望看到换行符和代表树中深度的整数,请删除它们。要删除空格,您可以例如使用如下正则表达式:

import re
ws_re= re.compile('\s+', re.MULTILINE)

然后替换:

key = next(li.stripped_strings)
# by
key = ws_re.sub(' ', next(li.stripped_strings))

然后看起来像这样:

0: A - Gastrointestinal kanal ve metabolizma
1: A01 - Stomatolojik preparatlar
2: A01A - Stomatolojik preparatlar
0: B - Kan ve kan yapıcı organlar