我使用Beautiful Soup来解析html以查找
的所有文本1.不包含任何锚元素
我想出了这个代码,它找到了href中的所有链接,但没有反过来。
如何修改此代码以仅使用Beautiful Soup获取纯文本,以便我可以查找并替换和修改汤?
for a in soup.findAll('a',href=True):
print a['href']
修改
示例:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
输出:
This should be identified
Identify me 1
Identify me 2
This paragraph should be identified.
我正在执行此操作以查找不在<a></a>
内的文本:然后找到“识别”并用“替换”替换操作
所以最终输出将是这样的:
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Repalced me 1
Replaced me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
谢谢你的时间!
答案 0 :(得分:3)
如果我理解你是正确的,你想得到一个包含href属性的元素内的文本。如果要获取元素的文本,可以使用.text
属性。
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed('<a href="http://something.com">this is some text</a>')
>>> soup.findAll('a', href=True)[0]['href']
u'http://something.com'
>>> soup.findAll('a', href=True)[0].text
u'this is some text'
修改
这将找到所有文本元素,并在其中标识:
>>> soup = BeautifulSoup.BeautifulSoup()
>>> soup.feed(yourhtml)
>>> [txt for txt in soup.findAll(text=True) if 'identified' in txt.lower()]
[u'\n This should be identified \n\n Identify me 1 \n\n Identify me 2 \n ', u' identified ']
返回的对象属于BeautifulSoup.NavigableString
类型。如果您想检查父级是否为a
元素,您可以执行txt.parent.name == 'a'
。
另一个编辑:
这是另一个使用正则表达式和替换的示例。
import BeautifulSoup
import re
soup = BeautifulSoup.BeautifulSoup()
html = '''
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br></div><div><br></div>
<div>
This should be identified
Identify me 1
Identify me 2
<p id="firstpara" align="center"> This paragraph should be<b> identified </b>.</p>
</div>
</body></html>
'''
soup.feed(html)
for txt in soup.findAll(text=True):
if re.search('identi',txt,re.I) and txt.parent.name != 'a':
newtext = re.sub(r'identi(\w+)', r'replace\1', txt.lower())
txt.replaceWith(newtext)
print(soup)
<html><body>
<div> <a href="www.test1.com/identify">test1</a> </div>
<div><br /></div>
<div><a href="www.test2.com/identify">test2</a></div>
<div><br /></div><div><br /></div>
<div>
this should be replacefied
replacefy me 1
replacefy me 2
<p id="firstpara" align="center"> This paragraph should be<b> replacefied </b>.</p>
</div>
</body></html>