我很困惑为什么我要
'NoneType' object has no attribute 'a'
这是我正在抓取的html结构
<section class ="videos"
<section class="box">
<a href="/videos/video.php?v=wshhH0xVL2LP4hFb0liu" class="video-box">
<img src="http://hw-static.exampl.net/.jpg" width="222" height="125" alt="">
</a>
<strong class="title"><a href="/videos/video.php?v=wshhH0xVL2LP4hFb0liu">Teen "Allegedly" </a></strong>
<div>
<span class="views">11,323</span>
<span class="comments"><a href="http://www.example.net/v" data-disqus-identifier="94137">44</a></span>
</div>
在我的Django应用程序中。如果我这样做
html = requests.get(vlad_url)
soup = BeautifulSoup(html.text, 'html.parser')
divs = soup.find('section', 'videos')
img = divs.find('img').get('src')
text = divs.strong.a.text
link = divs.a.get('href')
context = {
"ref": link,
"src": img,
"txt": text,
}
在我的观点中。这在我的模板中
{{ref}}
{{src}}
{{txt}}
我会得到一个结果。但是,当我试图像这样循环它们时
def get_vlad(url):
html = requests.get(url, headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')
divs = soup.findAll('section', 'box')
entries = [{'text': div.strong.a.text,
'link': div.a.get('href'),
'img': div.find('img').get('src')
} for div in divs]
return entries
我得到Nonetype错误,这是奇怪的,因为它确实存在。这也是奇怪的,因为我有另一个类似于这个工作的循环
def get_data(uri):
html = requests.get(uri, headers=headers)
soup = BeautifulSoup(html.text, "html.parser")
divs = soup.findAll('div', 'thumbnail')
entries = [{'text': div.text,
'href': div.find('a').get('href'),
'src': div.find('img').get('src')
} for div in divs][:6]
return entries
这是它适用的html结构
<div class="col-xs-12 col-md-4" id="split">
<div class="thumbnail thumb">
<h6 id="date">May 6, 2016</h6>
<img src="http://www.paraguayhits.com/wp-content/uploads/2015/11/Almighty-Ft.-N%CC%83engo-Flow-Por-Si-Roncan-660x330.jpg" class="img-responsive post">
<div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div>
<div class="caption" id="cap">
<a href="/blog/almighty-por-si-roncan-ft-nengo-flow-official-video/">
<h5 class="post-title" id="title">Almighty - Por Si Roncan (ft. Ñengo Flow) [Official Video]</h5>
</a>
<p>
<a href="/blog/76/delete/" class="btn" role="button">delete</a>
<a href="/blog/almighty-por-si-roncan-ft-nengo-flow-official-video/edit/" class="btn" role="button">edit</a>
</p>
</div>
</div>
这两者有什么区别?我如何循环我的结果
答案 0 :(得分:1)
html坏了,部分标签乱七八糟,我使用 html5lib 成功解析了bs4严重损坏的html:
In [21]: h = """<section class="videos"
....: <section class="box">
....: <a href="/videos/video.php?v=wshhH0xVL2LP4hFb0liu" class="video-box">
....: <img src="http://hw-static.exampl.net/.jpg" width="222" height="125" alt="">
....: </a>
....: <strong class="title"><a href="/videos/video.php?v=wshhH0xVL2LP4hFb0liu">Teen "Allegedly" </a></strong>
....: <div>
....: <span class="views">11,323</span>
....: <span class="comments"><a href="http://www.example.net/v" data-disqus-identifier="94137">44</a></span>
....: </div>"""
In [22]: from bs4 import BeautifulSoup
In [23]: soup = BeautifulSoup(h, 'html5lib')
In [24]: divs = soup.select_one('section.videos')
In [25]: img = divs.find('img').get('src')
In [26]: text = divs.strong.a.text
In [27]: link = divs.a.get('href')
In [28]: img
Out[28]: u'http://hw-static.exampl.net/.jpg'
In [29]: text
Out[29]: u'Teen "Allegedly" '
In [30]: link
Out[30]: u'/videos/video.php?v=wshhH0xVL2LP4hFb0liu'
正确的html应该类似于:
<section class ="videos">
<section class="box">
<a href="/videos/video.php?v=wshhH0xVL2LP4hFb0liu" class="video-box">
<img src="http://hw-static.exampl.net/.jpg" width="222" height="125" alt="">
</a>
<strong class="title"><a href="/videos/video.php?v=wshhH0xVL2LP4hFb0liu">Teen "Allegedly" </a></strong>
</section>
<div>
<span class="views">11,323</span>
<span class="comments"><a href="http://www.example.net/v" data-disqus-identifier="94137">44</a></span>
</div>
</section