如何使用beautifulsoup获取特定的ul元素文本和href没有任何类或id

时间:2018-04-27 12:18:43

标签: python html web-scraping beautifulsoup

我从邮件列表页面中删除了一个html:

<ul>
<li> <b>Messages sorted by:</b>
<a href="thread.html#start">[ thread ]</a>
<a href="author.html#start">[ author ]</a>
<a href="date.html#start">[ date ]</a>
<li><b><a href="https://mail.kde.org/mailman/listinfo/okular-devel">More info on this list...
                    </a></b></li>
</li></ul>, 


<ul>
<li><a href="000006.html">[Okular-devel] "why okular is cool and what's our focus" text
</a><a name="6"> </a>
<i>Albert Astals Cid
</i>
<li><a href="000000.html">[Okular-devel] playground/graphics/okular
</a><a name="0"> </a>
<i>Tobias Koenig
</i>
<li><a href="000001.html">[Okular-devel] playground/graphics/okular
</a><a name="1"> </a>
<i>Tobias Koenig
</i>
<li><a href="000004.html">[Okular-devel] Rotation &amp; object rects
</a><a name="4"> </a>
<i>Pino Toscano
</i>
<li><a href="000005.html">[Okular-devel] Rotation &amp; object rects
</a><a name="5"> </a>
<i>Albert Astals Cid
</i>
<li><a href="000002.html">[Okular-devel] Slow painting on QImage
</a><a name="2"> </a>
<i>Tobias Koenig
</i>
<li><a href="000003.html">[Okular-devel] Slow painting on QImage
</a><a name="3"> </a>
<i>Albert Astals Cid
</i>
</li></li></li></li></li></li></li></ul>, 


<ul>
<li> <b>Messages sorted by:</b>
<a href="thread.html#start">[ thread ]</a>
<a href="author.html#start">[ author ]</a>
<a href="date.html#start">[ date ]</a>
<li><b><a href="https://mail.kde.org/mailman/listinfo/okular-devel">More info on this list...
                    </a></b></li>
</li></ul>

你可以看到有三个<ul>个元素包含li元素,我只想获得第二个<ul>元素的li元素,只有<LI>大写和输出应该像:

[Okular-devel] "why okular is cool and what's our focus" text - 000006.html
[Okular-devel] playground/graphics/okular - 000000.html
[Okular-devel] playground/graphics/okular - 000001.html
[Okular-devel] Rotation & object rects - 000004.html
and so on...

格式是<LI>元素的文本和关联的<href>链接。我的代码给出了所有<ul>元素的li,输出重新刷新了2-3次,我无法将href与它们一起提取出来 -
我的代码:

for ele in soup.find_all('ul'):
    for litag in ele.find_all('li'):
        for link in litag.find_all('href'):
            print(litag.text + '-' + link)

它没有给我所需的输出。我该怎么办?

2 个答案:

答案 0 :(得分:0)

从您提供的HTML中解析。

from bs4 import BeautifulSoup
soup = BeautifulSoup(s, "html.parser")

for el in soup.find_all('ul'):
    for i in el.find_all("li"):
        if i.find("li"):
            print(i.li.a.text.strip(), "---", i.li.a['href'].strip())

<强>输出:

More info on this list... --- https://mail.kde.org/mailman/listinfo/okular-devel
[Okular-devel] playground/graphics/okular --- 000000.html
[Okular-devel] playground/graphics/okular --- 000001.html
[Okular-devel] Rotation & object rects --- 000004.html
[Okular-devel] Rotation & object rects --- 000005.html
[Okular-devel] Slow painting on QImage --- 000002.html
[Okular-devel] Slow painting on QImage --- 000003.html
More info on this list... --- https://mail.kde.org/mailman/listinfo/okular-devel

答案 1 :(得分:0)

您需要对锚标记进行查找:

soup = BeautifulSoup(html, "html.parser")
ele = soup.find_all('ul')[1]     # use only the 2nd one

for litag in ele.find_all('li'):
    for link in litag.find_all('a', href=True):
        print('{} - {}'.format(link.get_text(strip=True), link['href']))

给你:

[Okular-devel] "why okular is cool and what's our focus" text - 000006.html
[Okular-devel] playground/graphics/okular - 000000.html
[Okular-devel] playground/graphics/okular - 000001.html
[Okular-devel] Rotation & object rects - 000004.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] playground/graphics/okular - 000000.html
[Okular-devel] playground/graphics/okular - 000001.html
[Okular-devel] Rotation & object rects - 000004.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] playground/graphics/okular - 000001.html
[Okular-devel] Rotation & object rects - 000004.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] Rotation & object rects - 000004.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] Rotation & object rects - 000005.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] Slow painting on QImage - 000002.html
[Okular-devel] Slow painting on QImage - 000003.html
[Okular-devel] Slow painting on QImage - 000003.html

添加href=True可确保只返回包含href的标记。