目标是处理NLP中的标记化任务并将脚本从Perl script移植到此Python script。
主要问题是当我们运行tokenizer的Python端口时发生错误的反斜杠。
在Perl中,我们可能需要转义单引号和&符号:
my($text) = @_; # Reading a text from stdin
$text =~ s=n't = n't =g; # Puts a space before the "n't" substring to tokenize english contractions like "don't" -> "do n't".
$text =~ s/\'/\'/g; # Escape the single quote so that it suits XML.
将正则表达式逐字地移植到Python中
>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n\'t funny
&符号的转义以某种方式将其添加为字面反斜杠=(
要解决这个问题,我可以这样做:
>>> escape_singquote = r"\'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n't funny
但似乎没有逃避Python中的单引号,我们也得到了理想的结果:
>>> import re
>>> from six import text_type
>>> sent = text_type("this ain't funny")
>>> escape_singquote = r"\'", r"\'" # escape the left quote for XML
>>> contraction = r"n't", r" n't" # pad a space on the left when "n't" pattern is seen
>>> escape_singquote = r"'", r"'" # escape the left quote for XML
>>> text = sent
>>> for regexp, substitution in [contraction, escape_singquote]:
... text = re.sub(regexp, substitution, text)
... print text
...
this ai n't funny
this ai n't funny
现在这令人费解......
鉴于上面的上下文,所以问题是我们需要在Python中转义哪些字符以及Perl中的哪些字符? Perl和Python中的Regex不等同吗?
答案 0 :(得分:3)
在Perl和Python中,如果要在字符类 1 之外将它们匹配,则必须转义以下正则表达式元字符:
{}[]()^$.|*+?\
在字符类中,您必须根据这些规则 2 转义元字符:
Perl Python
-------------------------------------------------------------
- unless at beginning or end unless at beginning or end
] always unless at beginning
\ always always
^ only if at beginning only if at beginning
$ always never
请注意,无论是在字符类的内部还是外部,都不得转义单引号'
和&符号&
。
但是,Perl和Python都会忽略反斜杠,如果你用它来转义不是元字符的标点符号(例如\'
等同于正则表达式中的'
)。
你似乎被Python的raw strings绊倒了:
当存在
'r'
或'R'
前缀时,字符串中包含反斜杠后面的字符不做更改,并且所有反斜杠都保留在字符串中。
r"\'"
是字符串\'
(字面反斜杠,文字单引号),而r'\''
是字符串\'
(字面反斜杠,文字&符号等)。
所以这个:
re.sub(r"\'", r'\'', text)
将所有单引号替换为文字文本\'
。
总而言之,你的Perl替换写得更好:
$text =~ s/'/'/g;
你的Python替换写得更好:
re.sub(r"'", r''', text)