我想将文本包含在链接标记中的以下字符串中。我用re.sub。它有效,但我还需要每个2个链接标签具有不同的ID。如何实现?
input = "<span>Replace this</span> and <span>this</span>"
result = re.compile(r'>(.*?)<', re.I).sub(r'><a id="[WHAT TO PUT HERE?]" class="my_class">\1</a><', input)
输出在链接标记处应具有不同的ID:
"<span><a id="id1" class="my_class">Replace this</a></span></span> and <span><a id="id2" class="my_class">this</a></span>"
答案 0 :(得分:1)
正如ChristianKönig的链接所说,使用正则表达式解析HTML通常不是一个明智的想法。但是,如果您非常小心,如果HTML相对简单且稳定,可以有时可以使用它,但如果您正在解析的页面格式发生变化,那么您的代码可能会破坏。但无论如何......
上面给出的模式不工作:它还将在"> and <"
上执行替换。
这是一种做你想做的事的方法。我们使用函数作为repl
arg到re.sub
,我们给函数一个计数器(作为函数属性),因此它知道要使用的id号。每次更换时此计数器都会递增,但您可以在调用re.sub
之前将计数器设置为您想要的任何值。
import re
pat = re.compile(r'<span>(.*?)</span>', re.I)
def repl(m):
fmt = '<span><a id="id{}" class="my_class">{}</a></span>'
result = fmt.format(repl.count, m.group(1))
repl.count += 1
return result
repl.count = 1
data = (
"<span>Replace this</span> and <span>that</span>",
"<span>Another</span> test <span>string</span> of <span>tags</span>",
)
for s in data:
print('In : {!r}\nOut: {!r}\n'.format(s, pat.sub(repl, s)))
repl.count = 10
for s in data:
print('In : {!r}\nOut: {!r}\n'.format(s, pat.sub(repl, s)))
<强>输出强>
In : '<span>Replace this</span> and <span>that</span>'
Out: '<span><a id="id1" class="my_class">Replace this</a></span> and <span><a id="id2" class="my_class">that</a></span>'
In : '<span>Another</span> test <span>string</span> of <span>tags</span>'
Out: '<span><a id="id3" class="my_class">Another</a></span> test <span><a id="id4" class="my_class">string</a></span> of <span><a id="id5" class="my_class">tags</a></span>'
In : '<span>Replace this</span> and <span>that</span>'
Out: '<span><a id="id10" class="my_class">Replace this</a></span> and <span><a id="id11" class="my_class">that</a></span>'
In : '<span>Another</span> test <span>string</span> of <span>tags</span>'
Out: '<span><a id="id12" class="my_class">Another</a></span> test <span><a id="id13" class="my_class">string</a></span> of <span><a id="id14" class="my_class">tags</a></span>'