Question

I have a list of hyperlinks that are in the format < a href="/linkaddress" < /a> (spaces added in so that it displays).

Unfortunately, the format does not have the full address, so I would like to add the beginning of the web address by splicing two strings together. I have got some code that looks like this;

import requests
from bs4 import BeautifulSoup

r_2 = requests.get('http://www.website.com/linkaddress/')

soup = BeautifulSoup(r_2.text, 'html.parser')

links = soup.find_all('a')

links_list = []

for link in links:
    links_list.append(link)

link_end = links_list[9:-4]
# select information between 9th position and 4th last position 
link_start = 'http://www.website.com/'
master_links = link_start + link_end

print master_links

I have encountered a problem when trying to select just the link address from the hyperlink because it is not actually a string, it is a bs4.element.Tag. Is there a way that I can only select the link address from each entry in the list 'links_list'? Or do I have to convert it into a string?

Answer 1

Actually you don't need to specify attrs, just simply: link['href']

attrs is good in situations when you aren't sure if href is presented in attributes of some tag:

if 'href' in link.attrs:
    print(link['href'])

Answer 2

Each node has an attribute 'attrs' that's a python dictionary containing all the attributes defined on that node.

So, the address can be retrieved as:

link.attrs['href']

How do I choose a part of bs4.element.Tag, do I have to convert it into a string?

2 个答案: