有没有办法从字符串中删除所有html标记,但保留一些链接并更改其表示形式?例如:
description: <p>Animation params. For other animations, see <a href="#myA.animation">myA.animation</a> and the animation parameter under the API methods. The following properties are supported:</p>
<dl>
<dt>duration</dt>
<dd>The duration of the animation in milliseconds.</dd>
<dt>easing</dt>
<dd>A string reference to an easing function set on the <code>Math</code> object. See <a href="http://example.com">demo</a>.</dd>
</dl>
<p>
我要替换
<a href="#myA.animation">myA.animation</a>
只有'myA.animation',但
<a href="http://example.com">demo</a>
使用'demo:http://example.com'
编辑: 现在它似乎正在起作用:
def cleanComment(comment):
soup = BeautifulSoup(comment, 'html.parser')
for m in soup.find_all('a'):
if str(m) in comment:
if not m['href'].startswith("#"):
comment = comment.replace(str(m), m['href'] + " : " + m.__dict__['next_element'])
soup = BeautifulSoup(comment, 'html.parser')
comment = soup.get_text()
return comment
答案 0 :(得分:0)
此正则表达式适用于您:(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"
您可以尝试over here
在Python中:
import re
text = ''
with open('textfile', 'r') as file:
text = file.read()
matches = re.findall('(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"', text)
strings = []
for m in matches:
m = filter(bool, m)
strings.append(': '.join(m))
print(strings)
结果如下:['myA.animation', 'demo: http://example.com']