我具有以下html结构,可通过python通过bs4进行解析。
<div class="sidebar-widget-content" id="atc-wrapper">
<ul class="lists-rundown">
<li>
<a class="atc-group atc-group-active" href="" data-url="/atc-kodlari/1">
<i class="fa fa-lg fa-pulse fa-spinner atc-group-loading" style="margin-right: 5px; display: none;"></i>
A - Gastrointestinal kanal ve metabolizma
<span class="lists-rundown-no">(16)</span>
</a>
<ul style="margin-left: 40px; display: block;" class="atc-group-children">
<li style="border:1px solid #e1e1e1;"><a
class="atc-group atc-group-active" href="" data-
url="/atc-kodlari/2"><i class="fa fa-lg fa-pulse
fa-spinner atc-group-loading" style="margin-
right: 5px; display: none;"></i>A01 -
Stomatolojik preparatlar<span class="lists-
rundown-no">(1)</span></a><ul style="margin-
left: 40px; display: block;" class="atc-group-
children"><li style="border:1px solid #e1e1e1;">
<a class="atc-group" href="" data-url="/atc-
kodlari/3"><i class="fa fa-lg fa-pulse fa-
spinner atc-group-loading" style="margin-
right:5px;display:none;"></i>A01A - Stomatolojik
preparatlar<span class="lists-rundown-no">(4)
</span></a><ul style="margin-left:40px;"
class="atc-group-children"></ul></li></ul></li>
</ul>
</li>
<li>
<a class="atc-group" href="" data-url="/atc-kodlari/729">
<i class="fa fa-lg fa-pulse fa-spinner atc-group-loading" style="margin-right:5px;display:none;"></i>
B - Kan ve kan yapıcı organlar
<span class="lists-rundown-no">(5)</span>
</a>
<ul style="margin-left:40px;" class="atc-group-children">
</ul>
</li>
</ul>
</div>
我正试图将其解析为
def find_text(ul):
for li in ul.find_all("li"):
key = next(li.stripped_strings)
print(key)
ul = li.select_one("ul")
if ul:
find_text(ul)
find_text(source.select_one("#atc-wrapper > ul"))
打印内容
A-胃肠道kanal ve代谢代谢
B-Kan ve kanyapıcı细胞膜
C-KARDİYOVASKÜLERSİSTEM
D-DERMATOLOJİDEKULLANILANİLAÇLAR
G-GENİTOÜRİNERSİSTEMVE SEKS HORMONLARI ...
它不打印内部li文字应该是
A -
A01 -
A01A -
跳过中间层的原因可能是什么?
发现了问题。原始资源中不存在折叠的列表元素,因此无法获取。如何获取折叠数据。
答案 0 :(得分:0)
经过一段时间的尝试(即使最漂亮的汤有时也可能是不透明的:-),我认为我的代码可以正常工作了:
def find_text(ul, depth):
li= ul.find('li')
while li is not None:
key = next(li.stripped_strings)
print(f'{depth}: {key}')
children= li.children
for child in children:
if hasattr(child, 'stripped_strings'):
find_text(child, depth+1)
li= li.find_next_sibling('li')
find_text(soup, 0)
这将输出:
0: A - Gastrointestinal kanal ve metabolizma
1: A01 -
Stomatolojik preparatlar
2: A01A - Stomatolojik
preparatlar
0: B - Kan ve kan yapıcı organlar
如果您不希望看到换行符和代表树中深度的整数,请删除它们。要删除空格,您可以例如使用如下正则表达式:
import re
ws_re= re.compile('\s+', re.MULTILINE)
然后替换:
key = next(li.stripped_strings)
# by
key = ws_re.sub(' ', next(li.stripped_strings))
然后看起来像这样:
0: A - Gastrointestinal kanal ve metabolizma
1: A01 - Stomatolojik preparatlar
2: A01A - Stomatolojik preparatlar
0: B - Kan ve kan yapıcı organlar