我有这些代码爬行一个站点,从站点中提取所有需要的url,将url格式化为所需的格式,然后在我想将其添加到一个集合以进一步处理的点。我遇到以下错误,“AttributeError:'NoneType'对象没有属性'add'” 代码部分如下
class Finder(bs4.BeautifulSoup):
def __init__(self, m, page_url):
super().__init__(m, 'html.parser')
self.page_url = page_url
self.pdf_url_links = set()
def handle_starttag(self, name, namespace, nsprefix, attrs):
if name == 'a':
for (attributes, value) in attrs.items():
if ('.pdf&') not in value: pass
else:
list_of_links = search_queue_url(value)
print(list_of_links)
当我打印上面的变量'list_of_links'时,我的屏幕上显示以下网址:
https://julianoliver.com/share/free-science-books/basic_math_and_algebra.pdf
https://www.math.ksu.edu/~dbski/writings/further.pdf
http://www.math.harvard.edu/~shlomo/docs/Advanced_Calculus.pdf
http://www.textbooksonline.tn.nic.in/Books/Std10/Std10-Maths-EM-1.pdf
http://www.corestandards.org/wp-content/uploads/Math_Standards.pdf
https://www.ets.org/s/gre/pdf/gre_math_review.pdf
https://www.math.ust.hk/~machas/differential-equations.pdf
但是,当我尝试使用以下代码将上述每个网址添加到我的设置时
self.pdf_url_links.add(list_of_links)
我收到了以下错误,
AttributeError: 'NoneType' object has no attribute 'add'
追踪(最近一次呼叫最后一次):
File "C:\Projects\BookScapie\BookScrapie\BookScrapie\link_finders.py", line 7, in __init__
super().__init__(m, 'html.parser')
File "C:\python-3.5.1.amd64\lib\site-packages\bs4\__init__.py", line 228, in __init__
self._feed()
File "C:\python-3.5.1.amd64\lib\site-packages\bs4\__init__.py", line 289, in _feed
self.builder.feed(self.markup)
File "C:\python-3.5.1.amd64\lib\site-packages\bs4\builder\_htmlparser.py", line 167, in feed
parser.feed(markup)
File "C:\python-3.5.1.amd64\lib\html\parser.py", line 111, in feed
self.goahead(0)
File "C:\python-3.5.1.amd64\lib\html\parser.py", line 171, in goahead
k = self.parse_starttag(i)
File "C:\python-3.5.1.amd64\lib\html\parser.py", line 345, in parse_starttag
self.handle_starttag(tag, attrs)
File "C:\python-3.5.1.amd64\lib\site-packages\bs4\builder\_htmlparser.py", line 65, in handle_starttag
self.soup.handle_starttag(name, None, None, attr_dict)
File "C:\Projects\BookScapie\BookScrapie\BookScrapie\link_finders.py", line 22, in handle_starttag
self.pdf_url_links.add(list_of_links)
AttributeError:'NoneType'对象没有属性'add'
对于我正在做的事情,我将不胜感激。 我正在使用python 3.5
答案 0 :(得分:3)
不要继承BeautifulSoup
;相反,使用聚合及其查询接口!在BeautifulSoup.__init__
初始化之前,handle_starttag
会调用被覆盖的self.pdf_url_links
。
BeautifulSoup
班级有__getattr__
个实施sub-tag navigation,如果找不到匹配则返回None
;在这种情况下,汤中没有<pdf_url_links>
标记,因此self.pdf_url_links
会返回None
。
而是尝试像
这样的东西def find_links(m):
soup = bs4.BeautifulSoup(m, 'html.parser')
links = set()
for a in soup.find_all('a'):
href = a.get('href')
if href and '.pdf&' in href:
links.add(href)
return links
答案 1 :(得分:1)
问题是由BeautifulSoup构造函数的工作方式引起的。
class Finder(bs4.BeautifulSoup):
def __init__(self, m, page_url):
super().__init__(m, 'html.parser')
self.page_url = page_url
self.pdf_url_links = set()
一旦调用BeautifulSoup.__init__
,解析就会开始,最终会调用handle_starttag
方法。
此处,被覆盖的handle_starttag
版本试图访问未初始化的self.pdf_url_links
。
解决方案是在调用超级构造函数之前初始化解析所需的所有内容:
def __init__(self, m, page_url):
self.page_url = page_url
self.pdf_url_links = set()
super().__init__(m, 'html.parser')