我正在编写一个Python代码,以从以下网站抓取会议pdf:https://www.gmcameetings.co.uk pdf链接位于链接内,这些链接也位于链接内。我在上面的页面上有第一组链接,然后我需要在新的URL中抓取链接。 当我这样做时,出现以下错误:
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
到目前为止,这是我的代码,可以很好地在jupyter笔记本中进行检查:
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
这是随后收到错误的行:
second_links = meeting_links.find_all('a', href='TRUE')
我已经按照python的建议尝试了find(),但这也不起作用。但我知道它不能将Meeting_links视为一个项目。
因此,基本上,您如何在新字符串变量(meeting_links)的每一位中搜索链接。
一旦我有了第二组网址,我已经有了获取pdf的代码,这似乎可以正常工作,但显然需要首先获取它们。 希望这是有道理的,我已经解释了好-我只在星期一正确开始使用python,所以我是一个完整的初学者。
答案 0 :(得分:1)
要获取所有会议链接,请尝试
from bs4 import BeautifulSoup as bs
import requests
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# Scrape to find all links
all_links = soup.find_all('a', href=True)
# Loop through links to find those containing '/meetings/'
meeting_links = []
for link in all_links:
href = link['href']
if '/meetings/' in href:
meeting_links.append(href)
print(meeting_links)
您在原始代码中使用的.find()
函数特定于漂亮的汤对象。要在字符串中查找子字符串,只需使用本机Python:'a' in 'abcd'
。
希望有帮助!