我在我的django应用程序中使用beautifulsoup4来抓取数据。我能够从这个html结构中获取数据
<div class="thumbnail thumb">
<h6 id="date">May 9, 2016</h6>
<img src="http://assets.system.jpg" class="img-responsive post">
<div style="border-bottom: thin solid lightslategray; padding-bottom: 15px;"></div>
<div class="caption" id="cap">
<a href="/blog/homeland-security-attack/">
<h5 class="post-title" id="title">Homeland Security </h5>
</a>
<p>
<a href="/blog/88/delete/" class="btn" role="button">delete</a>
<a href="/blog/homeland-" class="btn" role="button">edit</a>
</p>
</div>
</div>
在我的视图中使用此
url = 'http://www.hispanicheights.com/'
google = requests.get(url)
bs = BeautifulSoup(google.content, 'html.parser')
divs = bs.findAll('div', 'thumbnail')
entries = [{'text': div.text,
'href': div.find('a').get('href'),
'src': div.find('img').get('src')
} for div in divs][:6]
但是当我试图刮掉这个html结构时
<div class="entry entry-pos-1" id="entry-217985">
<a href="/article/murder" data-page="1">
<p class="entry-comments">6</p>
<img data-original="/images17985.jpg" alt="Chicago Rapper & OTF Aff Murder" width="320" height="179" class="image-load" src="/images/size_mb/video-217985.jpg" style="display: block;">
</a>
<p class="entry-title">
<a href="/article/-murder" data-page="1">Chicago Rapper & OT Murder</a>
</p>
<p class="entry-meta">97 views</p>
<p class="entry-date">
<span class="entry-recent">11 Mins Ago</span>
</p>
</div>
使用相同的东西
ad_url = 'http://www.ad.com/'
ad_get = requests.get(ad_url, headers=headers)
ad_soup = BeautifulSoup(ad_get.content, 'html.parser')
ad_div = vlad_soup.findAll('div', 'entry')
ad_entry = [{'text': div.text,
'href': div.find('a').get('href'),
'src': div.find('img').get('src')
} for div in ad_div]
它得到错误Nonetype对象具有属性get
获取href和src的正确语法是什么?
答案 0 :(得分:0)
如果您为不包含锚点的div.find('a')
致电div
,则会返回None
。您的代码必须处理此问题。例如,你可以这样做:
entries = []
for div in vlad_div:
a = div.find('a')
img = div.find('img')
if a is not None and img is not None:
entry = {
'text': div.text
'href': a.get('href')
'src': img.get('src')
}
entries.append(entry)