更新:此问题是由regex
模块中的错误造成的,该错误由开发人员在 commit be893e9
如果遇到类似问题,请更新regex
模块。
您需要版本2017.04.23
或更高版本。
See here 了解更多信息。
背景:我在第三方Text2Speech引擎中使用正则表达式(english.lex
)的集合来在说出之前规范化输入文本。
出于调试目的,我编写了下面的脚本,看看我的正则表达式集合实际上对输入文本有什么影响。
我的问题是它正在取代simply does not match
的正则表达式regex_preview.py
#!/usr/bin/env python
import codecs
import regex as re
input="Text2Speach Regex Test.txt"
dictionary="english.lex"
with codecs.open(dictionary, "r", "utf16") as f:
reg_exen = f.readlines()
with codecs.open(input, "r+", "utf16") as g:
content = g.read().replace(r'\\\\\"','"')
# apply all regular expressions to content
for line in reg_exen:
line=line.strip()
# skip comments
if line == "" or line[0] == "#":
pass
else:
# remove " from lines and split them into pattern and substitue
pattern=re.sub('" "(.*[^\\\\])?"$','', line)[1:].replace('\\"','"')
substitute=re.sub('\\\\"', '"', re.sub('^".*[^\\\\]" "', '', line)[:-1]).replace('\\"','"')
print("\n'%s' ==> '%s'" % (pattern, substitute))
print(content.strip())
content = re.sub(pattern, substitute, content)
print(content.strip())
english.lex - utf16编码
# punctuation normalization
"(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+" "\""
"(…|—)" "..."
# stammered words: more general version accepting all words like ab... abcde (stammered words with vocal in stammered part)
"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t\f ]?)+(\1\w{2,})" "\1-\2"
# this should not match, but somehow it does o.O
Text2Speach Regex Test.txt - utf16编码
“Erm….yes. Thank you for that.”
运行脚本会生成此输出,最后一个正则表达式与内容匹配:
'(《|》|⟪|⟫|<|>|«|»|”|“|″|‴)+' ==> '"'
“Erm….yes. Thank you for that.”
"Erm….yes. Thank you for that."
'(…|—)' ==> '...'
"Erm….yes. Thank you for that."
"Erm....yes. Thank you for that."
'(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})' ==> '\1-\2'
"Erm....yes. Thank you for that."
"-yes. Thank you for that."
我创建了这个剪辑来重现这个问题:
#!/usr/bin/env python
import re
import codecs
content = u'"Erm....yes. Thank you for that."\n'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
content = re.sub(pattern, substitute, content)
print(content)
但这实际上表现得应该如此。所以我对这里发生的事情感到茫然。
希望有人能指出我正确的方向进行进一步调查......
答案 0 :(得分:2)
原始脚本使用备用regex
模块而不是标准库re
模块。
import regex as re
在这种情况下,两者之间显然存在一些差异。我的猜测是它与嵌套组有关。这个表达式包含一个非捕获组中的捕获组,这对我来说太神奇了。
import re # standard library
import regex # completely different implementation
content = '"Erm....yes. Thank you for that."'
pattern = r"(?i)(?<=\b)(?:(\w{1,3})(?:-|\.{2,10})[\t ]?)+(\1\w{2,})"
substitute = r"\1-\2"
print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))
输出:
"Erm....yes. Thank you for that."
"-yes. Thank you for that."