美丽的汤找到和正则表达式替换文字'不在<a></a>'

时间:2011-07-20 06:56:09

标签: python html find beautifulsoup

我使用Beautiful Soup来解析html以查找

的所有文本

1.不包含任何锚元素

我想出了这个代码,它找到了href中的所有链接,但没有反过来。

如何修改此代码以仅使用Beautiful Soup获取纯文本,以便我可以查找并替换和修改汤?

for a in soup.findAll('a',href=True):
    print a['href']

修改

示例:

<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Identify me 1 

   Identify me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>

输出:

This should be identified 
Identify me 1 
Identify me 2
This paragraph should be identified.

我正在执行此操作以查找不在<a></a>内的文本:然后找到“识别”并用“替换”替换操作

所以最终输出将是这样的:

<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Repalced me 1 

   Replaced me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>

谢谢你的时间!

1 个答案:

答案 0 :(得分:3)

如果我理解你是正确的,你想得到一个包含href属性的元素内的文本。如果要获取元素的文本,可以使用.text属性。

>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('<a href="http://something.com">this is some text</a>')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'

修改

这将找到所有文本元素,并在其中标识:

>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n   This should be identified \n\n   Identify me 1 \n\n   Identify me 2 \n   ', u' identified ']

返回的对象属于BeautifulSoup.NavigableString类型。如果您想检查父级是否为a元素,您可以执行txt.parent.name == 'a'

另一个编辑:

这是另一个使用正则表达式和替换的示例。

import BeautifulSoup
import re

soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
 <div> <a href="www.test1.com/identify">test1</a> </div>
 <div><br></div>
 <div><a href="www.test2.com/identify">test2</a></div>
 <div><br></div><div><br></div>
 <div>
   This should be identified 

   Identify me 1 

   Identify me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
 </div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
    if re.search('identi',txt,re.I) and txt.parent.name != 'a':
        newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
        txt.replaceWith(newtext)
print(soup)


<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br /></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br /></div><div><br /></div>
<div>
   this should be replacefied 

   replacefy me 1 

   replacefy me 2 
   <p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>