我正在尝试使用BeautifulSoup来解析页面上的地址列表。当我找到带有文本和嵌入式标签的标签时,如何仅从标签中获取文本而不在任何其他(较低级别)嵌入文本中获取文本?
我使用pTag在.html页面中从一个位置移动到另一个位置,这是我处理的代码:
Python命令行,我输入:>>>的 pTag.address
并收到页面代码的以下部分:
<address>
Some street address<br />City, State and ZIP<br />
<div class="phone">
(123) 456-7890
</div>
</address>
所以要抓住手机,我输入pTag.address.div.text并轻松搞定。我想获取未嵌套在另一个标签中的地址文本。如果没有电话信息,我可以用边缘情况进行重新编译,但我希望有更优雅的东西。
基本上这就是我想要的,更好的是用br标签:
Some street address<br />City, State and ZIP<br />
答案 0 :(得分:1)
可以使用extract
方法remove elements:
>>> from BeautifulSoup import BeautifulSoup
>>> s = '<html><address>Some street address<br />City, State and ZIP<br /><div class="phone">(123) 456-7890</div></address></html>'
>>> soup = BeautifulSoup(s)
>>> soup.address.div.extract()
<div class="phone">(123) 456-7890</div>
>>> [e.extract() for e in soup.address.findAll('br')]
[<br />, <br />]
>>> soup.address.text
u'Some street addressCity, State and ZIP'
答案 1 :(得分:1)
这感觉应该更容易,但我能想到的最好的是:
>>> from BeautifulSoup import BeautifulSoup, NavigableString
>>> html = """
... <html><head></head><body>
... <address>
... Some street address<br />City, State and ZIP<br />
... <div class="phone">
... (123) 456-7890
... </div>
... </address>
... </body></html>
... """
>>> soup = BeautifulSoup(html)
>>> tag = soup.find('address')
>>> ' '.join(item for item in tag.contents
... if isinstance(item, NavigableString)).strip()
u'Some street address City, State and ZIP'
修改强>
以下是使用lxml的替代解决方案:
>>> from lxml import etree
>>> tree = etree.HTML(html)
>>> tag = tree.xpath('//address')[0]
>>> ' '.join(tag.xpath('./text()')).strip()
'Some street address City, State and ZIP'
答案 2 :(得分:0)
我认为仅 BeautifulSoup不可能,而是可以从整个文本中删除不需要的文本(div class =“phone”标记内容)。它可以通过 -
轻松实现s = '<html><address>Some street address<br />City, State and ZIP<br /><div class="phone">(123) 456-7890</div></address></html>'
soup = BeautifulSoup(s)
s1 = soup.address.text // whole text
s2 = soup.address.div.text // unwanted text
pos = string.find(s1, s2)
s1 = s1[:pos] // removing unwanted text
print s1
答案 3 :(得分:0)
如果您对尝试PyQuery感兴趣,可以采用另一种方式:
from pyquery import PyQuery
s = '<html><address>Some street address<br />City, State and ZIP<br /><div class="phone">(123) 456-7890</div></address></html>'
d = pyquery.PyQuery(s)
print d('address').text()
# 'Some street address City, State and ZIP (123) 456-7890'
print d('address').remove('*').text()
# 'Some street address City, State and ZIP'
这会在提取文本内容之前从地址中删除所有子元素。
答案 4 :(得分:0)
使用正则表达式,它很快,并且可以轻松完成一次完全的提取:
import re
ss = '''a line
another line
<address>
Some street address<br />City, State and ZIP<br />
<div class="phone">
(123) 456-7890
</div>
<glomo>
Hello glomo
</glomo>
</address>
end of text'''
def analyze(ss,tag,regx = re.compile('<([^ ]+)([^>]*)>(.*?)</\\1>',re.DOTALL)):
extract = re.search('<(%s)[^>]*>(.*?)</\\1>' % tag,ss,re.DOTALL).group(2)
li = []
def trt(m):
li.append((m.group(1),m.group(2),m.group(3).strip(' \t\r\n')))
li.append(('','',regx.sub(trt,extract).strip('\r\n\t ')))
return li
resu = analyze(ss,'address')
for el in resu:
print el
print
print resu[-1][2]
结果
('div', ' class="phone"', '(123) 456-7890')
('glomo', '', 'Hello glomo')
('', '', 'Some street address<br />City, State and ZIP<br />')
Some street address<br />City, State and ZIP<br />
或者将结果放在字典中:
def analyze(ss,tag,regx = re.compile('<([^ ]+)([^>]*)>(.*?)</\\1>',re.DOTALL)):
extract = re.search('<(%s)[^>]*>(.*?)</\\1>' % tag,ss,re.DOTALL).group(2)
di = {}
def trt(m):
di[m.group(1)] = (m.group(2),m.group(3).strip(' \t\r\n'))
di[''] = ('',regx.sub(trt,extract).strip('\r\n\t '))
return di
disu = analyze(ss,'address')
print "disu[''] ==",disu['']
print "disu['div'] ==",disu['div']
print (disu[x][1] for x in disu if 'phone' in disu[x][0]).next()
结果
disu[''] == ('', 'Some street address<br />City, State and ZIP<br />')
disu['div'] == (' class="phone"', '(123) 456-7890')
(123) 456-7890
函数什么都不返回(或者更确切地说它返回None)然后regx.sub(trt,extract)
用""
替换提取中的标签,它只保留被检查标签中的文本。