Beautifulsoup tag.getText()给出空白值

时间:2016-04-25 09:51:01

标签: python-2.7 beautifulsoup

我从span中提取st类 使用以下代码:

  address = "http://www.google.com/search?q=%s&num=50&hl=en&start=0" % (urllib.quote_plus(query))
request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
urlfile = urllib2.urlopen(request)
page = urlfile.read()
  soup = BeautifulSoup(page)



divg=soup.findAll('div',attrs={'class':'g'})

for li in divg:
    try:
        print "\n\n"

        print "Link :"
        print li.find('h3').find('a')['href']

        print "Title "
        title=(li.find('h3',attrs={'class':'r'}))

        print title.getText()

        print "Body"
        body=(li.find('span',attrs={'class':'st'}))

        print body.getText()



    except:
        continue

print len(divg)

respcetive div如下:

<div class="g">
<!--m-->
<div class="rc" data-hveid="53">
    <h3 class="r">
        <a
            href="/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=45&amp;cad=rja&amp;uact=8&amp;ved=0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBAWCDYwBA&amp;url=http%3A%2F%2Fwww.ncbi.nlm.nih.gov%2Fpmc%2Farticles%2FPMC2658273%2F&amp;usg=AFQjCNGPmCR8qk2Zu2W0Yx4tgZV2vcLTSQ&amp;sig2=bF_cyrQY1qA5G3c-ZY8Cyg&amp;bvm=bv.119745492,d.c2E"
            onmousedown="return rwt(this,'','','','45','AFQjCNGPmCR8qk2Zu2W0Yx4tgZV2vcLTSQ','bF_cyrQY1qA5G3c-ZY8Cyg','0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBAWCDYwBA','','',event)"
            data-href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2658273/">The
            “4‐hour target”: emergency nurses' views - NCBI</a>
    </h3>
    <div class="s">
        <div>
            <div class="f kv _SWb" style="white-space: nowrap">
                <cite class="_Rm bc">www.ncbi.nlm.nih.gov › NCBI ›
                    Literature › PubMed Central (PMC)</cite>
                <div class="action-menu ab_ctl">
                    <a class="_Fmb ab_button" href="#" id="am-b44"
                        aria-label="Result details" aria-expanded="false"
                        aria-haspopup="true" role="button"
                        jsaction="m.tdd;keydown:m.hbke;keypress:m.mskpe"
                        data-ved="0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBDsHQg4MAQ"><span
                        class="mn-dwn-arw"></span></a>
                    <div class="action-menu-panel ab_dropdown" role="menu"
                        tabindex="-1"
                        jsaction="keydown:m.hdke;mouseover:m.hdhne;mouseout:m.hdhue"
                        data-ved="0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBCpHwg5MAQ">
                        <ol>
                            <li class="action-menu-item ab_dropdownitem" role="menuitem"><a
                                class="fl"
                                href="/search?biw=1024&amp;bih=738&amp;q=related:www.ncbi.nlm.nih.gov/pmc/articles/PMC2658273/+target+breach+2005&amp;tbo=1&amp;sa=X&amp;ved=0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBAfCDowBA">Similar</a></li>
                        </ol>
                    </div>
                </div>
            </div>
            <div class="f slp">
                by A Mortimore - &lrm;2007 - &lrm;<a class="fl"
                    href="https://scholar.google.co.in/scholar?biw=1024&amp;bih=738&amp;bav=on.2,or.r_cp.&amp;bvm=bv.119745492,d.c2E&amp;um=1&amp;ie=UTF-8&amp;lr&amp;cites=3213296797661648681"
                    onmousedown="return rwt(this,'','','','45','AFQjCNHE8YfvgTyRDBVn4TU3jtu4KUs-nQ','6JIlCuL7509JtYLCnAkMcA','0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBDOAgg8MAQ','','',event)">Cited
                    by 49</a> - &lrm;<a class="fl"
                    href="https://scholar.google.co.in/scholar?biw=1024&amp;bih=738&amp;bav=on.2,or.r_cp.&amp;bvm=bv.119745492,d.c2E&amp;um=1&amp;ie=UTF-8&amp;lr&amp;q=related:KdeFpmnslyxTsM:scholar.google.com/"
                    onmousedown="return rwt(this,'','','','45','AFQjCNHZJGbmVUicvi92tJNi69S5XgOGwQ','Zd8ZJ2OBi7nF6vwhTNg2jg','0ahUKEwim0sfq6KnMAhVTCY4KHTEuD7w4KBDPAgg9MAQ','','',event)">Related
                    articles</a>
            </div>
            <span class="st">Prior to the <em>target</em>, the emergency
                department (ED) included in this study had ..... be <em>breached</em>
                (letter) BMJ <em>2005</em>,
                http://www.bmj.com/cgi/eletters/330/7501/1188#&nbsp;...
            </span>
            <div class="_Tib">You visited this page on 25/4/16.</div>
        </div>
    </div>
</div>
<!--n-->

但是我得到了空白的结果。 对于大多数情况,代码运行正常但在某些情况下它会产生空白输出。

1 个答案:

答案 0 :(得分:0)

  

对于大多数情况,代码运行正常但在某些情况下它会产生空白输出。

这是因为除了常规搜索结果之外,具有g类的元素也可以表示图像缩略图。要将搜索限制为常规搜索结果,我需要使用divclass="srg"元素内查找它们:

divg = soup.select('div.srg div.g')
for li in divg:
    # ...

请注意,我假设您使用的是BeautifulSoup version 4,而您的导入是:

from bs4 import BeautifulSoup

您可能还需要使用utf-8编码打印的文本:

print body.text.encode("utf-8")