如何使用BeautifulSoup Bs4抓取html标签(我不想要文本)

时间:2015-05-27 11:55:46

标签: python beautifulsoup web-crawler

<div class="tioTrivia lightblue bottomRight show sticky" data-login-url="http://www.ntvspor.net/uyelik/giris?returnUrl=/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor%26utm_medium=oyun%26utm_campaign=iste_oyun" data-article-url="/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor&utm_medium=oyun&utm_campaign=iste_oyun&ref=isteoyun" data-profile-url="http://www.ntvspor.net/uyelik/profil" data-content-class="trivia-widget-position" data-start-place="bottom-right" data-show-points="true" data-article-id="Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği" style="transition: opacity 0.5s ease-in-out 0s, right 0.5s ease 0s; top: 832px;">

这个HTML是我的目标。我想抓住这一行;

data-article-id="Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği"

特别需要这一行;

"Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği"

我写这个函数但返回None;

 def read_tags(self, news_url):
        try:

            self.checkRequests(news_url)
            tag = self.soup.find("div", {'class':'tioTrivia lightblue bottomRight show sticky'})
            if tag:
                tag = tag.get_text().encode(encoding='utf-8')
                return tag.lower()
            return
        except Exception, e:
            self.insertErrorLog('ntvspor.read_title', news_url, e)

2 个答案:

答案 0 :(得分:0)

在您的代码和示例HTML中,tag.get_text()返回一个空字符串,因为div标记中没有内部文字。

为什么不通过引用属性从匹配的标记中获取data-article-id属性的值?

from bs4 import BeautifulSoup

soup = BeautifulSoup('''<div class="tioTrivia lightblue bottomRight show sticky" data-login-url="http://www.ntvspor.net/uyelik/giris?returnUrl=/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor%26utm_medium=oyun%26utm_campaign=iste_oyun" data-article-url="/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor&utm_medium=oyun&utm_campaign=iste_oyun&ref=isteoyun" data-profile-url="http://www.ntvspor.net/uyelik/profil" data-content-class="trivia-widget-position" data-start-place="bottom-right" data-show-points="true" data-article-id="Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği" style="transition: opacity 0.5s ease-in-out 0s, right 0.5s ease 0s; top: 832px;">''')
data = soup.find('div', class_='tioTrivia').get('data-article-id', '')
data = data.encode('utf8')

>>> data
'Tivibu,Man\xc5\x9fet,Futbol,Futbol,Spor Toto S\xc3\xbcper Lig,Be\xc5\x9fikta\xc5\x9f,Gen\xc3\xa7lerbirli\xc4\x9fi'
>>> print data
Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği

此外,您不需要指定class属性的所有值。在这种情况下,tioTrivia应该足够了,因为其他人(lightblue bottomRight show sticky)是表现性的,而不是semantic

答案 1 :(得分:0)

简单如下:

for t in soup.select('.tioTrivia'):
    print t.get('data-article-id')