如何在不获取进一步嵌入标记文本的情况下获取HTML标记中的文本?

时间:2011-12-29 10:30:05

标签: python beautifulsoup

我正在尝试使用BeautifulSoup来解析页面上的地址列表。当我找到带有文本和嵌入式标签的标签时,如何从标签中获取文本而不在任何其他(较低级别)嵌入文本中获取文本?

我使用pTag在.html页面中从一个位置移动到另一个位置,这是我处理的代码:

Python命令行,我输入:>>>的 pTag.address

并收到页面代码的以下部分:

<address>
                Some street address<br />City, State and ZIP<br />
<div class="phone">
                    (123) 456-7890
                </div>
</address>

所以要抓住手机,我输入pTag.address.div.text并轻松搞定。我想获取未嵌套在另一个标签中的地址文本。如果没有电话信息,我可以用边缘情况进行重新编译,但我希望有更优雅的东西。

基本上这就是我想要的,更好的是用br标签:

Some street address<br />City, State and ZIP<br />

5 个答案:

答案 0 :(得分:1)

可以使用extract方法remove elements

>>> from BeautifulSoup import BeautifulSoup
>>> s = '<html><address>Some street address<br />City, State and ZIP<br /><div class="phone">(123) 456-7890</div></address></html>'
>>> soup = BeautifulSoup(s)
>>> soup.address.div.extract()
<div class="phone">(123) 456-7890</div>
>>> [e.extract() for e in soup.address.findAll('br')]
[<br />, <br />]
>>> soup.address.text
u'Some street addressCity, State and ZIP'

答案 1 :(得分:1)

这感觉应该更容易,但我能想到的最好的是:

>>> from BeautifulSoup import BeautifulSoup, NavigableString
>>> html = """
... <html><head></head><body>
... <address>
...                 Some street address<br />City, State and ZIP<br />
... <div class="phone">
...                     (123) 456-7890
...                 </div>
... </address>
... </body></html>
... """
>>> soup = BeautifulSoup(html)
>>> tag = soup.find('address')
>>> ' '.join(item for item in tag.contents
...          if isinstance(item, NavigableString)).strip()
u'Some street address City, State and ZIP'

修改

以下是使用lxml的替代解决方案:

>>> from lxml import etree
>>> tree = etree.HTML(html)
>>> tag = tree.xpath('//address')[0]
>>> ' '.join(tag.xpath('./text()')).strip()
'Some street address City, State and ZIP'

答案 2 :(得分:0)

我认为 BeautifulSoup不可能,而是可以从整个文本中删除不需要的文本(div class =“phone”标记内容)。它可以通过 -

轻松实现
s = '<html><address>Some street address<br />City, State and ZIP<br /><div class="phone">(123) 456-7890</div></address></html>'
soup = BeautifulSoup(s)
s1 = soup.address.text                 // whole text
s2 = soup.address.div.text             // unwanted text
pos = string.find(s1, s2)
s1 = s1[:pos]                          // removing unwanted text
print s1

答案 3 :(得分:0)

如果您对尝试PyQuery感兴趣,可以采用另一种方式:

from pyquery import PyQuery
s = '<html><address>Some street address<br />City, State and ZIP<br /><div class="phone">(123) 456-7890</div></address></html>'
d = pyquery.PyQuery(s)
print d('address').text()
# 'Some street address City, State and ZIP (123) 456-7890'
print d('address').remove('*').text()
# 'Some street address City, State and ZIP'

这会在提取文本内容之前从地址中删除所有子元素。

答案 4 :(得分:0)

使用正则表达式,它很快,并且可以轻松完成一次完全的提取:

import re

ss = '''a line
another line
<address>
    Some street address<br />City, State and ZIP<br />
    <div class="phone">
        (123) 456-7890
    </div>
    <glomo>
        Hello glomo
    </glomo>
</address>
end of text'''


def analyze(ss,tag,regx = re.compile('<([^ ]+)([^>]*)>(.*?)</\\1>',re.DOTALL)):
    extract = re.search('<(%s)[^>]*>(.*?)</\\1>' % tag,ss,re.DOTALL).group(2)
    li = []
    def trt(m):
        li.append((m.group(1),m.group(2),m.group(3).strip(' \t\r\n')))
    li.append(('','',regx.sub(trt,extract).strip('\r\n\t ')))
    return li

resu = analyze(ss,'address')
for el in  resu:
    print el

print
print resu[-1][2]

结果

('div', ' class="phone"', '(123) 456-7890')
('glomo', '', 'Hello glomo')
('', '', 'Some street address<br />City, State and ZIP<br />')

Some street address<br />City, State and ZIP<br />

或者将结果放在字典中:

def analyze(ss,tag,regx = re.compile('<([^ ]+)([^>]*)>(.*?)</\\1>',re.DOTALL)):
    extract = re.search('<(%s)[^>]*>(.*?)</\\1>' % tag,ss,re.DOTALL).group(2)
    di = {}
    def trt(m):
        di[m.group(1)] = (m.group(2),m.group(3).strip(' \t\r\n'))
    di[''] = ('',regx.sub(trt,extract).strip('\r\n\t '))
    return di

disu = analyze(ss,'address')
print "disu[''] ==",disu['']
print "disu['div'] ==",disu['div']
print (disu[x][1] for x in disu if 'phone' in disu[x][0]).next()

结果

disu[''] == ('', 'Some street address<br />City, State and ZIP<br />')
disu['div'] == (' class="phone"', '(123) 456-7890')
(123) 456-7890

函数什么都不返回(或者更确切地说它返回None)然后regx.sub(trt,extract)""替换提取中的标签,它只保留被检查标签中的文本。