如何过滤强标记下的文字?

时间:2013-11-25 04:29:47

标签: python python-2.7 beautifulsoup

我有这段代码:

url = 'http://www.topsoftzone.com/program/12721/Windows_Phone_7.html'
pageurl = urllib.urlopen(url)
soup = BeautifulSoup
print soup.find('table',{'class':'download_tab'}).find('td',{'width':'55%'}).find('strong').text

我应该得到这样的输出:09/29/2011(已提交:09/08/2011)

但代码输出:已更新:

2 个答案:

答案 0 :(得分:2)

我猜你错过了trtable

之间td的表格行

无论如何,请考虑使用带xpath的lxml

from lxml import etree
tree = etree.parse(url, etree.HTMLParser())
l = tree.xpath('//table[@class="download_tab"]/tr/td[@width="55%"]/text()')
print l[1]

09/29/2011 (Submitted: 09/08/2011)

编辑:未按要求提供lxml

soup = BeautifulSoup(pageurl)
l = soup.find('table',{'class':'download_tab'}).find('tr').find('td',{'width':'55%'}).findAll(text=True)
print l[2]

09/29/2011 (Submitted: 09/08/2011)

答案 1 :(得分:1)

您需要更多错误检查,但这可行

import lxml.html
import urllib
import sys

link = "http://www.topsoftzone.com/program/12721/Windows_Phone_7.html"

page = urllib.urlopen(link).read()

doc = lxml.html.document_fromstring(page)
doc.make_links_absolute(link)

found_text = doc.xpath(u".//table[@class='download_tab']/tr/td[@width='55%']/text()")
try:
    print found_text[1].strip()
except:
    print "Not found"