Question

有没有办法从字符串中删除所有html标记，但保留一些链接并更改其表示形式？例如：

    description: <p>Animation params. For other animations, see <a href="#myA.animation">myA.animation</a> and the animation parameter under the API methods.       The following properties are supported:</p>
<dl>
  <dt>duration</dt>
  <dd>The duration of the animation in milliseconds.</dd>
<dt>easing</dt>
<dd>A string reference to an easing function set on the <code>Math</code> object. See <a href="http://example.com">demo</a>.</dd>
</dl>
<p>

我要替换

<a href="#myA.animation">myA.animation</a>

只有'myA.animation'，但

<a href="http://example.com">demo</a>

使用'demo：http://example.com'

编辑：现在它似乎正在起作用：

def cleanComment(comment):
    soup = BeautifulSoup(comment, 'html.parser')
    for m in soup.find_all('a'):
        if str(m) in comment:
            if not m['href'].startswith("#"):
                comment = comment.replace(str(m), m['href'] + " : " + m.__dict__['next_element'])
    soup = BeautifulSoup(comment, 'html.parser')
    comment = soup.get_text()
    return comment

Answer 1

此正则表达式适用于您：(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"

您可以尝试over here

在Python中：

import re

text = ''
with open('textfile', 'r') as file:
    text = file.read()

matches = re.findall('(?=href="http)(?=(?=.*?">(.*?)<)(?=.*?"(https?:\/\/.*?)"))|"#(.*?)"', text)

strings = []
for m in matches:
    m = filter(bool, m)
    strings.append(': '.join(m))

print(strings)

结果如下：['myA.animation', 'demo: http://example.com']

Python - 从html标签中剥离字符串，保留链接但更改形式

1 个答案: