import requests
from bs4 import BeautifulSoup
url = 'https://joboutlook.gov.au/A-Z'
r = requests.get(url)
c = r.content
soup = BeautifulSoup(c, 'html.parser')
urls = []
h4s = soup.find_all('h4')
for h4 in h4s:
a = h4.find('a')
print(a)
href = a['href']
print(href)
new_url = f'https://joboutlook.gov.au/{href}'
print(new_url)
urls.append(new_url)
urls
打印全部工作。 (a)显示所有'a'标记,(href)显示所有hrefs,(new_url)显示所有新网址!
但是我仍然得到TypeError: 'NoneType' object is not subscriptable
,并且没有将任何内容添加到网址列表中。
如果我将其更改为a.get('href')
,则会显示:AttributeError: 'NoneType' object has no attribute 'get'
(实际上不是Google,只是fyi)
这可能很简单,但我无法弄清楚。
谢谢!
答案 0 :(得分:1)
提供条件,如果锚标记可用,则获取href
并将其附加。
import requests
from bs4 import BeautifulSoup
soup=BeautifulSoup(requests.get("https://joboutlook.gov.au/A-Z").text,'html.parser')
urls = []
h4s = soup.find_all('h4')
for h4 in h4s:
a = h4.find('a')
if a:
href = a['href']
#print(href)
new_url ='https://joboutlook.gov.au/{}'.format(href)
#print(new_url)
urls.append(new_url)
print(urls)
答案 1 :(得分:1)
更改为使用选择器,该选择器过滤具有h4
属性的子元素的href
。
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://joboutlook.gov.au/A-Z')
soup = bs(r.content, 'lxml')
links = [f'https://joboutlook.gov.au/{item["href"]}' for item in soup.select('h4 > [href]')]
您可以假设所有a
标签都具有href
(速度稍快,强度较低,但可能还不错)
links = [f'https://joboutlook.gov.au/{item["href"]}' for item in soup.select('h4 > a')]