Question

我将使用正则表达式替换df系列中匹配字符串的子字符串。我仔细阅读了文档（例如HERE），发现了一个能够捕获要匹配的特定类型字符串的解决方案。但是，在替换期间，它不会替换子字符串。

我有诸如此类的情况

data
initthe problem
nationthe airline
radicthe groups
professionthe experience
the cat in the hat

在这种特殊情况下，我有兴趣在“ the”不是独立字符串的情况下（即在其后跟空白）用“ al”替换“ the”。

我尝试了以下解决方案：

patt = re.compile(r'(?:[a-z])(the)')
df['data'].str.replace(patt, r'al')

但是，它还会替换“ the”之前的非空白字符。

关于如何处理子字符串的特定情况的任何建议？

Answer 1

尝试使用后向搜索，后向搜索会检查（断言）the之前的字符，但实际上并没有消耗任何东西：

input = "data\ninitthe problem\nnationthe airline\nradicthe groups\nprofessionthe experience\nthe cat in the hat"

output = re.sub(r'(?<=[a-z])the', 'al', input)
print(output)

data
inital problem
national airline
radical groups
professional experience
the cat in the hat

替换匹配python中的特定子字符串

1 个答案:

Demo