示例1：html具有预期的div

Question

我收到错误

＆＃39; NoneType＆＃39;对象没有属性＆＃39;编码＆＃39;

当我运行此代码时

url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})


 fobj = open('D:\Scrapping\parveen_urls.txt', 'w')

 for getting in url:
   fobj.write(getting.string.encode('utf8'))

但是当我使用find而不是findAll时，我会得到一个url。我如何通过findAll从对象获取所有URL？

Answer 1

'NoneType' object has no attribute 'encode'

您正在使用.string。如果代码有多个子级.string，则为None（docs）：

如果标记的唯一子标记是另一个标记，并且该标记具有.string，则父标记被认为与其子标记具有相同的.string：

改为使用.get_text()。

Answer 2

下面我提供两个示例和一个可能的解决方案：

实施例1显示了工作样品。
示例2显示了一个非工作样本，引发了您报告的错误。
解决方案显示了一种可能的解决方案。

示例1：html具有预期的div

    doc = ['<html><head><title>Page title</title></head>',
    '<body><div class="entry-content"><div>http://teste.com</div>',
    '<div>http://teste2.com</div></div></body>',
    '</html>']       
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls.txt', 'w')
for getting in url:
  fobj.write(getting.string.encode('utf8'))

示例2：html在内容中没有预期的div

doc = ['<html><head><title>Page title</title></head>',
    '<body><div class="entry"><div>http://teste.com</div>',
    '<div>http://teste2.com</div></div></body>',
    '</html>']       
soup = BeautifulSoup(''.join(doc))

""" 
The error will rise here because the first find does not return nothing, 
and nothing is equals to None. Calling "findAll" on a None object will
raise: AttributeError: 'NoneType' object has no attribute 'findAll' 
"""
url = soup.find('div',attrs={"class":"entry-content"}).findAll('div', attrs={"class":None})
fobj = open('.\parveen_urls2.txt', 'w')
for getting in url:
  fobj.write(getting.string.encode('utf8'))

可能的解决方案：

doc = ['<html><head><title>Page title</title></head>',
    '<body><div class="entry"><div>http://teste.com</div>',
    '<div>http://teste2.com</div></div></body>',
    '</html>']     
soup = BeautifulSoup(''.join(doc))
url = soup.find('div',attrs={"class":"entry-content"})

"""
Deal with documents that do not have the expected html structure
"""
if url:
    url = url.findAll('div', attrs={"class":None})
    fobj = open('.\parveen_urls2.txt', 'w')
    for getting in url:
        fobj.write(getting.string.encode('utf8'))
else:
    print("The html source does not comply with expected structure")

Answer 3

我发现问题属于NULL数据。

我通过FILTER OUT NULL DATA修复了它

NoneType对象没有属性＆＃39; encode＆＃39; （网页报废）

3 个答案:

示例1：html具有预期的div

示例2：html在内容中没有预期的div

可能的解决方案：