我从邮件列表页面中删除了一个html:
<ul>
<li> <b>Messages sorted by:</b>
<a href="thread.html#start">[ thread ]</a>
<a href="author.html#start">[ author ]</a>
<a href="date.html#start">[ date ]</a>
<li><b><a href="https://mail.kde.org/mailman/listinfo/okular-devel">More info on this list...
</a></b></li>
</li></ul>,
<ul>
<li><a href="000006.html">[Okular-devel] "why okular is cool and what's our focus" text
</a><a name="6"> </a>
<i>Albert Astals Cid
</i>
<li><a href="000000.html">[Okular-devel] playground/graphics/okular
</a><a name="0"> </a>
<i>Tobias Koenig
</i>
<li><a href="000001.html">[Okular-devel] playground/graphics/okular
</a><a name="1"> </a>
<i>Tobias Koenig
</i>
<li><a href="000004.html">[Okular-devel] Rotation & object rects
</a><a name="4"> </a>
<i>Pino Toscano
</i>
<li><a href="000005.html">[Okular-devel] Rotation & object rects
</a><a name="5"> </a>
<i>Albert Astals Cid
</i>
<li><a href="000002.html">[Okular-devel] Slow painting on QImage
</a><a name="2"> </a>
<i>Tobias Koenig
</i>
<li><a href="000003.html">[Okular-devel] Slow painting on QImage
</a><a name="3"> </a>
<i>Albert Astals Cid
</i>
</li></li></li></li></li></li></li></ul>,
<ul>
<li> <b>Messages sorted by:</b>
<a href="thread.html#start">[ thread ]</a>
<a href="author.html#start">[ author ]</a>
<a href="date.html#start">[ date ]</a>
<li><b><a href="https://mail.kde.org/mailman/listinfo/okular-devel">More info on this list...
</a></b></li>
</li></ul>
你可以看到有三个<ul>
个元素包含li元素,我只想获得第二个<ul>
元素的li元素,只有<LI>
大写和输出应该像:
[Okular-devel] "why okular is cool and what's our focus" text - 000006.html
[Okular-devel] playground/graphics/okular - 000000.html
[Okular-devel] playground/graphics/okular - 000001.html
[Okular-devel] Rotation & object rects - 000004.html
and so on...
格式是<LI>
元素的文本和关联的<href>
链接。我的代码给出了所有<ul>
元素的li,输出重新刷新了2-3次,我无法将href与它们一起提取出来 -
我的代码:
for ele in soup.find_all('ul'):
for litag in ele.find_all('li'):
for link in litag.find_all('href'):
print(litag.text + '-' + link)
它没有给我所需的输出。我该怎么办?
答案 0 :(得分:0)
从您提供的HTML中解析。
from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")
for el in soup.find_all('ul'):
for i in el.find_all("li"):
if i.find("li"):
print(i.li.a.text.strip(), "---", i.li.a['href'].strip())
<强>输出:强>
More info on this list... --- https://mail.kde.org/mailman/listinfo/okular-devel
[Okular-devel] playground/graphics/okular --- 000000.html
[Okular-devel] playground/graphics/okular --- 000001.html
[Okular-devel] Rotation & object rects --- 000004.html
[Okular-devel] Rotation & object rects --- 000005.html
[Okular-devel] Slow painting on QImage --- 000002.html
[Okular-devel] Slow painting on QImage --- 000003.html
More info on this list... --- https://mail.kde.org/mailman/listinfo/okular-devel
答案 1 :(得分:0)
您需要对锚标记进行查找:
soup = BeautifulSoup(html, "html.parser")
ele = soup.find_all('ul')[1] # use only the 2nd one
for litag in ele.find_all('li'):
for link in litag.find_all('a', href=True):
print('{} - {}'.format(link.get_text(strip=True), link['href']))
给你:
[Okular-devel] "why okular is cool and what's our focus" text - 000006.html
[Okular-devel] playground/graphics/okular - 000000.html
[Okular-devel] playground/graphics/okular - 000001.html
[Okular-devel] Rotation & object rects - 000004.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] playground/graphics/okular - 000000.html
[Okular-devel] playground/graphics/okular - 000001.html
[Okular-devel] Rotation & object rects - 000004.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] playground/graphics/okular - 000001.html
[Okular-devel] Rotation & object rects - 000004.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] Rotation & object rects - 000004.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] Slow painting on QImage - 000003.html
添加href=True
可确保只返回包含href
的标记。