I need all hrefs present in 'a' tag and assign it to a variable I did this, but only got first link
soup_level1 = BeautifulSoup(driver.page_source, 'lxml')
userName = soup_level1.find(class_='_32mo')
link1 = (userName.get('href'))
And the output i get is
print(link1)
https://www.facebook.com/xxxxxx?ref=br_rs
But i need atleast top 3 or top 5 links The structure of webpage is
`<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
I need those hrefs
答案 0 :(得分:2)
from bs4 import BeautifulSoup
html="""
<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
"""
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
for link in my_links:
print(link.get('href'))
输出
https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
https://www.facebook.com/zzzzz?ref=br_rs
要获得前n个链接,您可以使用
max_num_of_links=2
for link in my_links[:max_num_of_links]:
print(link.get('href'))
输出
https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
您还可以将前n个链接保存到列表中
link_list=[]
max_num_of_links=2
for link in my_links[:max_num_of_links]:
link_list.append(link.get('href'))
print(link_list)
输出
['https://www.facebook.com/xxxxx?ref=br_rs', 'https://www.facebook.com/yyyyy?ref=br_rs']
编辑:
如果您需要驱动程序一个个地获取链接
max_num_of_links=3
for link in my_links[:max_num_of_links]:
driver.get(link.get('href'))
# rest of your code ...
出于某些原因,如果您希望使用不同的变量(例如link1,link2等)。
from bs4 import BeautifulSoup
html="""
<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
"""
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
link1=my_links[0].get('href')
link2=my_links[1].get('href')
link3=my_links[2].get('href')
# and so on, but be careful here you don't want to try to access a link which is not there or you'll get index error