Python - 从html标签中剥离字符串,保留链接但更改形式

时间:2017-02-23 09:53:44

标签: python html

有没有办法从字符串中删除所有html标记,但保留一些链接并更改其表示形式?例如:

    description: <p>Animation params. For other animations, see <a href="#myA.animation">myA.animation</a> and the animation parameter under the API methods.       The following properties are supported:</p>
<dl>
  <dt>duration</dt>
  <dd>The duration of the animation in milliseconds.</dd>
<dt>easing</dt>
<dd>A string reference to an easing function set on the <code>Math</code> object. See <a href="http://example.com">demo</a>.</dd>
</dl>
<p>

我要替换

<a href="#myA.animation">myA.animation</a> 

只有'myA.animation',但

<a href="http://example.com">demo</a>

使用'demo:http://example.com'

编辑: 现在它似乎正在起作用:

def cleanComment(comment):
    soup = BeautifulSoup(comment, 'html.parser')
    for m in soup.find_all('a'):
        if str(m) in comment:
            if not m['href'].startswith("#"):
                comment = comment.replace(str(m), m['href'] + " : " + m.__dict__['next_element'])
    soup = BeautifulSoup(comment, 'html.parser')
    comment = soup.get_text()
    return comment

1 个答案:

答案 0 :(得分:0)

此正则表达式适用于您:(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"

您可以尝试over here

在Python中:

import re

text = ''
with open('textfile', 'r') as file:
    text = file.read()

matches = re.findall('(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"', text)

strings = []
for m in matches:
    m = filter(bool, m)
    strings.append(': '.join(m))

print(strings)

结果如下:['myA.animation', 'demo: http://example.com']