<div class="tioTrivia lightblue bottomRight show sticky" data-login-url="http://www.ntvspor.net/uyelik/giris?returnUrl=/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor%26utm_medium=oyun%26utm_campaign=iste_oyun" data-article-url="/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor&utm_medium=oyun&utm_campaign=iste_oyun&ref=isteoyun" data-profile-url="http://www.ntvspor.net/uyelik/profil" data-content-class="trivia-widget-position" data-start-place="bottom-right" data-show-points="true" data-article-id="Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği" style="transition: opacity 0.5s ease-in-out 0s, right 0.5s ease 0s; top: 832px;">
这个HTML是我的目标。我想抓住这一行;
data-article-id="Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği"
特别需要这一行;
"Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği"
我写这个函数但返回None;
def read_tags(self, news_url):
try:
self.checkRequests(news_url)
tag = self.soup.find("div", {'class':'tioTrivia lightblue bottomRight show sticky'})
if tag:
tag = tag.get_text().encode(encoding='utf-8')
return tag.lower()
return
except Exception, e:
self.insertErrorLog('ntvspor.read_title', news_url, e)
答案 0 :(得分:0)
在您的代码和示例HTML中,tag.get_text()
返回一个空字符串,因为div
标记中没有内部文字。
为什么不通过引用属性从匹配的标记中获取data-article-id
属性的值?
from bs4 import BeautifulSoup
soup = BeautifulSoup('''<div class="tioTrivia lightblue bottomRight show sticky" data-login-url="http://www.ntvspor.net/uyelik/giris?returnUrl=/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor%26utm_medium=oyun%26utm_campaign=iste_oyun" data-article-url="/haber/futbol/131009/uniteda-yeni-arjantinli?utm_source=ntvspor&utm_medium=oyun&utm_campaign=iste_oyun&ref=isteoyun" data-profile-url="http://www.ntvspor.net/uyelik/profil" data-content-class="trivia-widget-position" data-start-place="bottom-right" data-show-points="true" data-article-id="Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği" style="transition: opacity 0.5s ease-in-out 0s, right 0.5s ease 0s; top: 832px;">''')
data = soup.find('div', class_='tioTrivia').get('data-article-id', '')
data = data.encode('utf8')
>>> data
'Tivibu,Man\xc5\x9fet,Futbol,Futbol,Spor Toto S\xc3\xbcper Lig,Be\xc5\x9fikta\xc5\x9f,Gen\xc3\xa7lerbirli\xc4\x9fi'
>>> print data
Tivibu,Manşet,Futbol,Futbol,Spor Toto Süper Lig,Beşiktaş,Gençlerbirliği
此外,您不需要指定class
属性的所有值。在这种情况下,tioTrivia
应该足够了,因为其他人(lightblue bottomRight show sticky
)是表现性的,而不是semantic。
答案 1 :(得分:0)
简单如下:
for t in soup.select('.tioTrivia'):
print t.get('data-article-id')