如何获取重复多次的ul标签内的标签的href?

时间:2019-09-16 02:49:44

标签: python html beautifulsoup

我想做的是获取重复多次的ul中a标签的href:例如:

<div class="contain">
    <div id="0">
        <ul class="drop">
            <li><a href="some_link"></a></li>
            <li><a href="some_link_1"></a></li>
            <li><a href="some_link_2"></a></li>
            <li><a href="some_link_3"></a></li>
        </ul>
    </div>
</div>
<div class="contain">
        <div id="1">
            <ul class="drop">
                <li><a href="some_link_4"></a></li>
                <li><a href="some_link_5"></a></li>
                <li><a href="some_link_6"></a></li>
                <li><a href="some_link_7"></a></li>
            </ul>
        </div>
    </div>
    <div class="contain">
            <div id="a">
                <ul class="drop">
                    <li><a href="some_link_7"></a></li>
                    <li><a href="some_link_8"></a></li>
                    <li><a href="some_link_9"></a></li>
                    <li><a href="some_link"></a></li>
                </ul>
            </div>
        </div>

我想要的是将所有href都包含在这段代码中。我该怎么办?

2 个答案:

答案 0 :(得分:1)

from bs4 import BeautifulSoup

html = '''<div class="contain">
    <div id="0">
        <ul class="drop">
            <li><a href="some_link"></a></li>
            <li><a href="some_link_1"></a></li>
            <li><a href="some_link_2"></a></li>
            <li><a href="some_link_3"></a></li>
        </ul>
    </div>
</div>
<div class="contain">
        <div id="1">
            <ul class="drop">
                <li><a href="some_link_4"></a></li>
                <li><a href="some_link_5"></a></li>
                <li><a href="some_link_6"></a></li>
                <li><a href="some_link_7"></a></li>
            </ul>
        </div>
    </div>
    <div class="contain">
            <div id="a">
                <ul class="drop">
                    <li><a href="some_link_7"></a></li>
                    <li><a href="some_link_8"></a></li>
                    <li><a href="some_link_9"></a></li>
                    <li><a href="some_link"></a></li>
                </ul>
            </div>
        </div>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print "The URL:", a['href']

这将打印所有href

The URL: some_link
The URL: some_link_1
The URL: some_link_2
The URL: some_link_3
The URL: some_link_4
The URL: some_link_5
The URL: some_link_6
The URL: some_link_7
The URL: some_link_7
The URL: some_link_8
The URL: some_link_9
The URL: some_link

要获取所有链接的列表,您可以简单地使用:

hrefLinks = [EachLink['href'] for EachLink in  soup.find_all('a', href=True)]

答案 1 :(得分:1)

根据您对href元素中所有要使用的ul所说的话,它会更准确:

links = [i['href'] for i in soup.select('.drop [href]')]

这将使用父级ul(第二快的选择器方法)和子级[href]的类名,它将继承(a标记中的所有href属性,父ul中的所有内容。

使用其他答案,您将获得文档中与href标记关联的所有a,无论是否有父ul