Question

import requests
from bs4 import BeautifulSoup

url = 'https://joboutlook.gov.au/A-Z'

r = requests.get(url)
c = r.content
soup = BeautifulSoup(c, 'html.parser')

urls = []
h4s = soup.find_all('h4')
for h4 in h4s:
    a = h4.find('a')
    print(a)
    href = a['href']
    print(href)
    new_url = f'https://joboutlook.gov.au/{href}'
    print(new_url)
    urls.append(new_url)
urls

打印全部工作。（a）显示所有'a'标记，（href）显示所有hrefs，（new_url）显示所有新网址！

但是我仍然得到TypeError: 'NoneType' object is not subscriptable，并且没有将任何内容添加到网址列表中。

如果我将其更改为a.get('href')，则会显示：AttributeError: 'NoneType' object has no attribute 'get'

（实际上不是Google，只是fyi）

这可能很简单，但我无法弄清楚。

谢谢！

Answer 1

提供条件，如果锚标记可用，则获取href并将其附加。

import requests
from bs4 import BeautifulSoup
soup=BeautifulSoup(requests.get("https://joboutlook.gov.au/A-Z").text,'html.parser')
urls = []
h4s = soup.find_all('h4')
for h4 in h4s:
    a = h4.find('a')
    if a:
     href = a['href']
     #print(href)
     new_url ='https://joboutlook.gov.au/{}'.format(href)
     #print(new_url)
     urls.append(new_url)

print(urls)

Answer 2

更改为使用选择器，该选择器过滤具有h4属性的子元素的href。

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://joboutlook.gov.au/A-Z')
soup = bs(r.content, 'lxml')
links = [f'https://joboutlook.gov.au/{item["href"]}' for item in soup.select('h4 > [href]')]

您可以假设所有a标签都具有href（速度稍快，强度较低，但可能还不错）

links = [f'https://joboutlook.gov.au/{item["href"]}' for item in soup.select('h4 > a')]

BeautifulSoup：“ TypeError / AttributeError：'NoneType'”

2 个答案: