如何在python的同一类中提取多个链接?

时间:2019-06-03 11:47:46

标签: python beautifulsoup

我想从以下代码中提取同一div类中的所有链接:

<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>

我尝试过:

from bs4 import BeautifulSoup

html="<div class='page-numbers clearfix'><span class='current'>1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>4</a></div>
"

soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div', {'class': 'page-numbers clearfix'}):
    link= i.find('a', href=True)
    print(link['href'])

但这似乎不起作用。我需要的输出是:

https://www.example.com/blog/author/abc/page/2/

https://www.example.com/blog/author/abc/page/3/

https://www.example.com/blog/author/abc/page/4/

5 个答案:

答案 0 :(得分:2)

您还必须在找到find_all标签的同时使用a。下面的代码可以正常工作。

from bs4 import BeautifulSoup as bs

stra = """
<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>
"""
soup = bs(stra, 'html.parser')
for i in soup.find_all('div', {'class': 'page-numbers clearfix'}):
    links = i.find_all('a', href=True)
    for link in links:
        print(link['href'])

输出:

https://www.example.com/blog/author/abc/page/2/
https://www.example.com/blog/author/abc/page/3/
https://www.example.com/blog/author/abc/page/4/

答案 1 :(得分:2)

这里所有其他好的答案都有可能(略短)变化:

<class 'int'>
<class 'int'>
<class 'str'>
<class 'list'>

答案 2 :(得分:1)

这将为您提供链接列表:

Route::put('/postca', 'CAsController@.....');

from bs4 import BeautifulSoup html_doc = '''<div class='page-numbers clearfix'><span class='current'> 1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'> 2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'> 3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'> 4</a></div>''' soup = BeautifulSoup(html_doc, "lxml") div = soup.find('div', attrs={'class': 'page-numbers clearfix'}) containers = div.find_all('a', attrs={'class': 'inactive'}) links = [c['href'] for c in containers] 返回:

links

答案 3 :(得分:0)

尝试以下代码。

data='''<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>'''


soup=BeautifulSoup(data,'html.parser')

item= soup.find('div', class_="page-numbers clearfix")
for item in item.find_all('a', href=True):
    print(item['href'])

输出:

https://www.example.com/blog/author/abc/page/2/
https://www.example.com/blog/author/abc/page/3/
https://www.example.com/blog/author/abc/page/4/

答案 4 :(得分:0)

您可以使用CSS选择器:

from bs4 import BeautifulSoup

data = '''<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>'''

soup = BeautifulSoup(data, 'lxml')

for a in soup.select('div.page-numbers.clearfix a[href]'):
    print(a['href'])

打印:

https://www.example.com/blog/author/abc/page/2/
https://www.example.com/blog/author/abc/page/3/
https://www.example.com/blog/author/abc/page/4/