Python正则表达式替换不匹配的字符串

时间:2017-04-22 15:05:20

标签: python regex

更新:此问题是由regex模块中的错误造成的,该错误由开发人员在 commit be893e9

中解决

如果遇到类似问题,请更新regex模块。
您需要版本2017.04.23或更高版本。

See here 了解更多信息。

背景:我在第三方Text2Speech引擎中使用正则表达式(english.lex)的集合来在说出之前规范化输入文本。

出于调试目的,我编写了下面的脚本,看看我的正则表达式集合实际上对输入文本有什么影响。

我的问题是它正在取代simply does not match

的正则表达式

我有3个文件:

regex_preview.py

#!/usr/bin/env python
import codecs
import regex as re

input="Text2Speach Regex Test.txt"
dictionary="english.lex"

with codecs.open(dictionary, "r", "utf16") as f:
    reg_exen = f.readlines()
    with codecs.open(input, "r+", "utf16") as g:
        content = g.read().replace(r'\\\\\"','"')

        # apply all regular expressions to content
        for line in reg_exen:
            line=line.strip()

            # skip comments
            if line == "" or line[0] == "#":
                pass
            else:
                # remove " from lines and split them into pattern and substitue
                pattern=re.sub('" "(.*[^\\\\])?"$','', line)[1:].replace('\\"','"')
                substitute=re.sub('\\\\"', '"', re.sub('^".*[^\\\\]" "', '', line)[:-1]).replace('\\"','"')

                print("\n'%s' ==> '%s'" % (pattern, substitute))

                print(content.strip())
                content = re.sub(pattern, substitute, content)
                print(content.strip())

english.lex - utf16编码

# punctuation normalization
"(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+" "\""
"(…|—)" "..."

# stammered words: more general version accepting all words like ab... abcde (stammered words with vocal in stammered part)
"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t\f ]?)+(\1\w{2,})" "\1-\2"
# this should not match, but somehow it does o.O

Text2Speach Regex Test.txt - utf16编码

“Erm….yes. Thank you for that.”

运行脚本会生成此输出,最后一个正则表达式与内容匹配:

'(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+' ==> '"'
“Erm….yes. Thank you for that.”
"Erm….yes. Thank you for that."

'(…|—)' ==> '...'
"Erm….yes. Thank you for that."
"Erm....yes. Thank you for that."

'(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})' ==> '\1-\2'
"Erm....yes. Thank you for that."
"-yes. Thank you for that."

到目前为止我尝试了什么:

我创建了这个剪辑来重现这个问题:

#!/usr/bin/env python

import re
import codecs

content = u'"Erm....yes. Thank you for that."\n'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
content = re.sub(pattern, substitute, content)

print(content)

但这实际上表现得应该如此。所以我对这里发生的事情感到茫然。

希望有人能指出我正确的方向进行进一步调查......

1 个答案:

答案 0 :(得分:2)

原始脚本使用备用regex模块而不是标准库re模块。

import regex as re

在这种情况下,两者之间显然存在一些差异。我的猜测是它与嵌套组有关。这个表达式包含一个非捕获组中的捕获组,这对我来说太神奇了。

import re     # standard library
import regex  # completely different implementation

content = '"Erm....yes. Thank you for that."'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"

print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))

输出:

"Erm....yes. Thank you for that."
"-yes. Thank you for that."