Question

我正在从当地政府网站上抓取很多委员会会议的pdf文件。（https://www.gmcameetings.co.uk/）因此，在链接中存在链接。我可以成功地从页面的主要区域（我想要的那些）中刮取所有的'a'标记，但是当我尝试刮除其中的任何内容时，都会出现问题标题的错误： AttributeError：ResultSet对象没有属性“ find”。您可能正在将项目列表像单个项目一样对待。当您打算调用find（）时，是否调用过find_all（）？我该如何解决？

我是编码的新手，昨天开始了实习，我希望对此进行网络搜刮。我应该与之共事的女人已经不在这里几天了，没有其他人可以帮助我-所以请忍受我，并要仁慈，因为我是一个完整的初学者，所以要独自做。我知道我已经正确设置了代码的第一部分，因为我可以下载整个页面或下载任何特定的链接。再次，当我尝试在已经（并成功地刮擦）的链接内刮擦时，会收到上述错误消息。我认为（据我所知不多）是因为“ all_links”的“输出”如下所示。我已经尝试了find（）和findAll（）两者都导致相同的错误消息。

 #the error message
 date_links_area = all_links.find('ul',{"class":"item-list item-list-- 
 rich"})
 Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "C:\Users\rache\AppData\Local\Programs\Python\Python37-32\lib\site- 
 packages\bs4\element.py", line 1620, in __getattr__
 "ResultSet object has no attribute '%s'. You're probably treating a list 
 of items like a single item. Did you call find_all() when you meant to 
 call 
 find()?" % key
 AttributeError: ResultSet object has no attribute 'find'. You're probably 
 treating a list of items like a single item. Did you call find_all() when 
 you meant to call find()?

#output of all_links looks like this (this is only part of it)

href =“ https://www.gmcameetings.co.uk/info/20180/live_meetings/199/membership_201819”>会员 GMCA 2018/19的版本曼彻斯特联合当局章程，会议文件，

其中一些链接然后转到包含日期列表的页面-这是我要访问的页面区域。然后在该区域中，我需要获取带有日期的链接。然后在其中，我需要获取我想要的pdf。道歉，如果这没有道理。我正在以零经验尝试自己做到这一点。

Answer 1

此解决方案使用递归连续地刮取每个页面上的链接，直到找到PDF网址：

from bs4 import BeautifulSoup as soup
import requests
def scrape(url):
  try:
    for i in soup(requests.get(url).text, 'html.parser').find('main', {'id':'content'}).find_all('a'):
      if '/downloads/meeting/' in i['href'] or '/downloads/file/' in i['href']:
         yield i
      elif i['href'].startswith('https://www.gmcameetings.co.uk'):
         yield from scrape(i['href'])
  except:
      pass

urls = list(scrape('https://www.gmcameetings.co.uk/'))

Answer 2

该错误实际上是在告诉您问题所在。 all_links是找到的HTML元素的列表（ResultSet对象）。您需要遍历该列表并在每个列表上调用find：

sub_links = [all_links.find('ul',{"class":"item-list item-list-- 
 rich"}) for link in all_links]

在已经使用python

2 个答案: