我从span中提取st类 使用以下代码:
address = "http://www.google.com/search?q=%s&num=50&hl=en&start=0" % (urllib.quote_plus(query))
request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
urlfile = urllib2.urlopen(request)
page = urlfile.read()
soup = BeautifulSoup(page)
divg=soup.findAll('div',attrs={'class':'g'})
for li in divg:
try:
print "\n\n"
print "Link :"
print li.find('h3').find('a')['href']
print "Title "
title=(li.find('h3',attrs={'class':'r'}))
print title.getText()
print "Body"
body=(li.find('span',attrs={'class':'st'}))
print body.getText()
except:
continue
print len(divg)
respcetive div如下:
<div class="g">
<!--m-->
<div class="rc" data-hveid="53">
<h3 class="r">
<a
href="/url?sa=t&rct=j&q=&esrc=s&source=web&cd=45&cad=rja&uact=8&ved=0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBAWCDYwBA&url=http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC2658273%2F&usg=AFQjCNGPmCR8qk2Zu2W0Yx4tgZV2vcLTSQ&sig2=bF_cyrQY1qA5G3c-ZY8Cyg&bvm=bv.119745492,d.c2E"
onmousedown="return rwt(this,'','','','45','AFQjCNGPmCR8qk2Zu2W0Yx4tgZV2vcLTSQ','bF_cyrQY1qA5G3c-ZY8Cyg','0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBAWCDYwBA','','',event)"
data-href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2658273/">The
“4‐hour target”: emergency nurses' views - NCBI</a>
</h3>
<div class="s">
<div>
<div class="f kv _SWb" style="white-space: nowrap">
<cite class="_Rm bc">www.ncbi.nlm.nih.gov › NCBI ›
Literature › PubMed Central (PMC)</cite>
<div class="action-menu ab_ctl">
<a class="_Fmb ab_button" href="#" id="am-b44"
aria-label="Result details" aria-expanded="false"
aria-haspopup="true" role="button"
jsaction="m.tdd;keydown:m.hbke;keypress:m.mskpe"
data-ved="0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBDsHQg4MAQ"><span
class="mn-dwn-arw"></span></a>
<div class="action-menu-panel ab_dropdown" role="menu"
tabindex="-1"
jsaction="keydown:m.hdke;mouseover:m.hdhne;mouseout:m.hdhue"
data-ved="0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBCpHwg5MAQ">
<ol>
<li class="action-menu-item ab_dropdownitem" role="menuitem"><a
class="fl"
href="/search?biw=1024&bih=738&q=related:www.ncbi.nlm.nih.gov/pmc/articles/PMC2658273/+target+breach+2005&tbo=1&sa=X&ved=0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBAfCDowBA">Similar</a></li>
</ol>
</div>
</div>
</div>
<div class="f slp">
by A Mortimore - ‎2007 - ‎<a class="fl"
href="https://scholar.google.co.in/scholar?biw=1024&bih=738&bav=on.2,or.r_cp.&bvm=bv.119745492,d.c2E&um=1&ie=UTF-8&lr&cites=3213296797661648681"
onmousedown="return rwt(this,'','','','45','AFQjCNHE8YfvgTyRDBVn4TU3jtu4KUs-nQ','6JIlCuL7509JtYLCnAkMcA','0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBDOAgg8MAQ','','',event)">Cited
by 49</a> - ‎<a class="fl"
href="https://scholar.google.co.in/scholar?biw=1024&bih=738&bav=on.2,or.r_cp.&bvm=bv.119745492,d.c2E&um=1&ie=UTF-8&lr&q=related:KdeFpmnslyxTsM:scholar.google.com/"
onmousedown="return rwt(this,'','','','45','AFQjCNHZJGbmVUicvi92tJNi69S5XgOGwQ','Zd8ZJ2OBi7nF6vwhTNg2jg','0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBDPAgg9MAQ','','',event)">Related
articles</a>
</div>
<span class="st">Prior to the <em>target</em>, the emergency
department (ED) included in this study had ..... be <em>breached</em>
(letter) BMJ <em>2005</em>,
http://www.bmj.com/cgi/eletters/330/7501/1188# ...
</span>
<div class="_Tib">You visited this page on 25/4/16.</div>
</div>
</div>
</div>
<!--n-->
但是我得到了空白的结果。 对于大多数情况,代码运行正常但在某些情况下它会产生空白输出。
答案 0 :(得分:0)
对于大多数情况,代码运行正常但在某些情况下它会产生空白输出。
这是因为除了常规搜索结果之外,具有g
类的元素也可以表示图像缩略图。要将搜索限制为常规搜索结果,我需要使用div
在class="srg"
元素内查找它们:
divg = soup.select('div.srg div.g')
for li in divg:
# ...
请注意,我假设您使用的是BeautifulSoup
version 4,而您的导入是:
from bs4 import BeautifulSoup
您可能还需要使用utf-8
编码打印的文本:
print body.text.encode("utf-8")