如何获取所有hrefs(在<a tag)="" and="" assign="" them="" to="" a="" variable??=""

时间:2019-01-05 15:17:45

标签: python selenium web-scraping beautifulsoup

="" I need all hrefs present in 'a' tag and assign it to a variable I did this, but only got first link

soup_level1 = BeautifulSoup(driver.page_source, 'lxml')
userName = soup_level1.find(class_='_32mo')
link1 = (userName.get('href'))

And the output i get is

print(link1)
https://www.facebook.com/xxxxxx?ref=br_rs

But i need atleast top 3 or top 5 links The structure of webpage is

 `<div>
  <a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
  </div>
   <div>
  <a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
  </div>
  <div>
  <a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
  </div>

I need those hrefs

1 个答案:

答案 0 :(得分:2)

from bs4 import BeautifulSoup
html="""
<div>
  <a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
  </div>
   <div>
  <a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
  </div>
  <div>
  <a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
  </div>
  """
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
for link in my_links:
    print(link.get('href'))

输出

https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
https://www.facebook.com/zzzzz?ref=br_rs

要获得前n个链接,您可以使用

max_num_of_links=2
for link in my_links[:max_num_of_links]:
    print(link.get('href'))

输出

https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs

您还可以将前n个链接保存到列表中

link_list=[]
max_num_of_links=2
for link in my_links[:max_num_of_links]:
    link_list.append(link.get('href'))
print(link_list)

输出

['https://www.facebook.com/xxxxx?ref=br_rs', 'https://www.facebook.com/yyyyy?ref=br_rs']

编辑:

如果您需要驱动程序一个个地获取链接

max_num_of_links=3
for link in my_links[:max_num_of_links]:
        driver.get(link.get('href'))
        # rest of your code ...

出于某些原因,如果您希望使用不同的变量(例如link1,link2等)。

from bs4 import BeautifulSoup
html="""
<div>
  <a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
  </div>
   <div>
  <a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
  </div>
  <div>
  <a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
  </div>
  """
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
link1=my_links[0].get('href')
link2=my_links[1].get('href')
link3=my_links[2].get('href')
# and so on, but be careful here you don't want to try to access a link which is not there or you'll get index error