将项目列表视为单个项目错误:如何在已抓取的字符串中的每个“链接”中查找链接

时间:2019-07-11 15:22:02

标签: python web-scraping beautifulsoup python-requests

我正在编写一个Python代码,以从以下网站抓取会议pdf:https://www.gmcameetings.co.uk pdf链接位于链接内,这些链接也位于链接内。我在上面的页面上有第一组链接,然后我需要在新的URL中抓取链接。 当我这样做时,出现以下错误:

AttributeError: ResultSet object has no attribute 'find_all'. You're 
probably treating a list of items like a single item. Did you call 
find_all() when you meant to call find()?

到目前为止,这是我的代码,可以很好地在jupyter笔记本中进行检查:

# importing libaries and defining
import requests
import urllib.request
import time 
from bs4 import BeautifulSoup as bs

# set url
url = "https://www.gmcameetings.co.uk/" 

# grab html 
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')

# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'

# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
    print(link['href'])
    if link['href'].find('/meetings/')>1:
        print("Meeting!") 

这是随后收到错误的行:

second_links = meeting_links.find_all('a', href='TRUE')

我已经按照python的建议尝试了find(),但这也不起作用。但我知道它不能将Meeting_links视为一个项目。

因此,基本上,您如何在新字符串变量(meeting_links)的每一位中搜索链接。

一旦我有了第二组网址,我已经有了获取pdf的代码,这似乎可以正常工作,但显然需要首先获取它们。 希望这是有道理的,我已经解释了好-我只在星期一正确开始使用python,所以我是一个完整的初学者。

1 个答案:

答案 0 :(得分:1)

要获取所有会议链接,请尝试

from bs4 import BeautifulSoup as bs
import requests

# set url
url = "https://www.gmcameetings.co.uk/" 

# grab html 
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')

# Scrape to find all links
all_links = soup.find_all('a', href=True)

# Loop through links to find those containing '/meetings/'
meeting_links = []
for link in all_links:
    href = link['href']
    if '/meetings/' in href:
        meeting_links.append(href)
print(meeting_links)

您在原始代码中使用的.find()函数特定于漂亮的汤对象。要在字符串中查找子字符串,只需使用本机Python:'a' in 'abcd'

希望有帮助!