Question

我必须将&替换为其输入字符串中的名称实体或十进制实体，但输入字符串可能包含其他名称，并且带有&的十进制实体将出现。

代码：

import re
text =' At&T, " < I am > , At&T  so  &#60; &lt; &  & '

#- Get all name entities and decimal entities.
replace_tmp = re.findall("&#\d+;|&[a-z]+;", text)

#- Replace above values from tempvalues.
tmp_dict = {}
count = 1
for i in replace_tmp:
    text = text.replace(i, "$%d$"%count)
    tmp_dict["$%d$"%count] = i
    count += 1


#- Replace & with &amp;
text = text.replace("&", "&amp;")

#- Replace tempvalues values with original.
for i in tmp_dict:
    text = text.replace(i, tmp_dict[i])

print text

最终输出：At&T, " < I am > , At&T so < < & &

但是，我可以获得直接执行上述操作的正则表达式吗？

py文件中的最后一行：

value = re.sub(r'&(?!(#[0-9]+;|[a-zA-Z]+;))', '&', value).replace("<", "<").replace(">", ">").replace('"', """)

Answer 1

使用带有负面预测的字符串替换。

import re
text =' At&T, " < I am > , At&T  so  &#60; &lt; &  & '

text = re.sub(r'&(?![\w\d#]+?;)',"&amp;",text)
print text

Answer 2

>>> import re
>>> re.sub(r'&(?!(#[0-9]+;|\w+;))', '&amp;', ' At&T, " < I am > , At&T  so  &#60; &lt; &  & ')
' At&amp;T, " < I am > , At&amp;T  so  &#60; &lt; &amp;  &amp; '

您可以对\w+;（例如： ）和#[0-9]+;（对于#60;）使用负向预测断言。

因此正则表达式是：

&(?!(#[0-9]+;|\w+;)) 否定前瞻断言确保#[0-9]+;之前既没有\w+;也没有&

您也可以使用[a-zA-Z]+;代替\w+;

重新替换＆amp;与名称实体

2 个答案: