我想从以下代码中提取同一div类中的所有链接:
<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>
我尝试过:
from bs4 import BeautifulSoup
html="<div class='page-numbers clearfix'><span class='current'>1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>4</a></div>
"
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div', {'class': 'page-numbers clearfix'}):
link= i.find('a', href=True)
print(link['href'])
但这似乎不起作用。我需要的输出是:
https://www.example.com/blog/author/abc/page/2/
https://www.example.com/blog/author/abc/page/3/
https://www.example.com/blog/author/abc/page/4/
答案 0 :(得分:2)
您还必须在找到find_all
标签的同时使用a
。下面的代码可以正常工作。
from bs4 import BeautifulSoup as bs
stra = """
<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>
"""
soup = bs(stra, 'html.parser')
for i in soup.find_all('div', {'class': 'page-numbers clearfix'}):
links = i.find_all('a', href=True)
for link in links:
print(link['href'])
输出:
https://www.example.com/blog/author/abc/page/2/
https://www.example.com/blog/author/abc/page/3/
https://www.example.com/blog/author/abc/page/4/
答案 1 :(得分:2)
这里所有其他好的答案都有可能(略短)变化:
<class 'int'>
<class 'int'>
<class 'str'>
<class 'list'>
答案 2 :(得分:1)
这将为您提供链接列表:
Route::put('/postca', 'CAsController@.....');
from bs4 import BeautifulSoup
html_doc = '''<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>'''
soup = BeautifulSoup(html_doc, "lxml")
div = soup.find('div', attrs={'class': 'page-numbers clearfix'})
containers = div.find_all('a', attrs={'class': 'inactive'})
links = [c['href'] for c in containers]
返回:
links
答案 3 :(得分:0)
尝试以下代码。
data='''<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>'''
soup=BeautifulSoup(data,'html.parser')
item= soup.find('div', class_="page-numbers clearfix")
for item in item.find_all('a', href=True):
print(item['href'])
输出:
https://www.example.com/blog/author/abc/page/2/
https://www.example.com/blog/author/abc/page/3/
https://www.example.com/blog/author/abc/page/4/
答案 4 :(得分:0)
您可以使用CSS选择器:
from bs4 import BeautifulSoup
data = '''<div class='page-numbers clearfix'><span class='current'>
1</span><a href='https://www.example.com/blog/author/abc/page/2/' class='inactive'>
2</a><a href='https://www.example.com/blog/author/abc/page/3/' class='inactive'>
3</a><a href='https://www.example.com/blog/author/abc/page/4/' class='inactive'>
4</a></div>'''
soup = BeautifulSoup(data, 'lxml')
for a in soup.select('div.page-numbers.clearfix a[href]'):
print(a['href'])
打印:
https://www.example.com/blog/author/abc/page/2/
https://www.example.com/blog/author/abc/page/3/
https://www.example.com/blog/author/abc/page/4/