Question

我正在用表格检查是否已输入适合日文格式的邮政编码。我今天意识到，即使不应该通过“正则表达式匹配测试”，也可以获得一些信息。

这是正则表达式：

".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

它包括普通数字和日语数字（与“-”相同，也可以输入日语一个“ー”），格式应为： 123-4567。

仅输入拉丁字母和数字时，它可以正常工作。但是某些根本不匹配的日语字符将作为match返回：

（注意：匹配项将返回结果，没有匹配项将不返回任何结果。）

>>> import re
>>> regstr = ".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

>>> re.match( regstr, "this is obviously not going to work")
>>> re.match( regstr, "this is going to work 123-4567")
<_sre.SRE_Match object at 0x7fced8b485d0>
>>> re.match( regstr, "this is going to work too １２３ー４５６７")
<_sre.SRE_Match object at 0x7fced8b48648>

>>> re.match( regstr, "This will not work, as it should not :  1234-567")
>>> re.match( regstr, "This should not work, but it does :  １２３４ー５６７")
<_sre.SRE_Match object at 0x7fced8b48648>
>>> re.match( regstr, "Now just seems crazy ....... 京都府")
<_sre.SRE_Match object at 0x7fced8b485d0>
>>> re.match( regstr, "京都府")
<_sre.SRE_Match object at 0x7fced8b48648>

>>> "京都府"
'\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c'
>>> re.match( regstr, "\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c")
<_sre.SRE_Match object at 0x7fced8b48648>

我尝试输入汉字，而我尝试输入的两个字符不匹配。

因此，住在京都府的任何人都可以“绕过”正则表达式，因为“京都府”足以使整个字符串有效。这三个字符中只有两个不起作用。

我尝试使用这三个字符的unicode代码，并且它也匹配（我想知道是否可以使用该代码代替字符本身来解析字符串，并想确保其中不包含该代码）可能实际上适合'000-0000'的东西。虽然不适合，但仍与regex匹配）。

住在东京“东京府”的人们会“少”碰运气：

>>> re.match( regstr, "東京府")
>>> "東京府"
'\xe6\x9d\xb1\xe4\xba\xac\xe5\xba\x9c'

我在那儿检查了：https://regex101.com/，而那3个字符没有

所以...我在这里迷路了。使用更简单的“。（[0-9] {3} [-] {1} [0-9] {4}）。”作为正则表达式，看起来不错，但我真的不想限制用户仅输入[0-9-]，因为许多人将输入日语版本０１２３４５６７８８９ー（更长）。如果重要的话：

# 'Japanese numbers' code
>>> "０１２３４５６７８９ー"
'\xef\xbc\x90\xef\xbc\x91\xef\xbc\x92\xef\xbc\x93\xef\xbc\x94\xef\xbc\x95\xef\xbc\x96\xef\xbc\x97\xef\xbc\x98\xef\xbc\x99\xe3\x83\xbc'

我现在只要将日语０１２３４５６７９８８转换为0123456789-，并应用一个完全不包含日语字符的正则表达式，但是...我真的很想知道正则表达式和日语字符是怎么回事。

如果有人有一些线索，将不胜感激。

欢呼

编辑：python 2.7

Answer 1

regstr = ".*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*"

在Python 3中，regstr是包含一些非ASCII字符的Unicode字符串。在Python 2中，它是用某种编码方式编码的字符串，这取决于您在模块开始时声明的内容（请参见PEP 263）和实际用于保存文件的编码。为避免此类问题，建议您不要在正则表达式中使用unicode字符。这太难调试了。逃避它们。

字符０１２３４５６７８８９是Unicode字符'\uff10'至'\uff19'，因此，建议您将其原样使用。

此外，如果您使用的是Unicode正则表达式，则应使用unicode strings的u前缀来定义它：

regstr = u".*([0-9\uff10-\uff19]{3}[-\u30fc]{1}[0-9\uff10-\uff19]{4}).*"

稍后，当您将此正则表达式与某个字符串匹配时，该其他字符串也应该是unicode字符串，而不是普通的str。为此，您必须知道输入的编码方式。例如，如果输入为utf-8，请使用：

input_string_as_unicode = unicode(input_string_as_utf8, 'utf-8')
re.match(regstr, input_string_as_unicode)

请注意，如果背后有一些框架可以为您完成输入，那么您可能已经将其输入为unicode。如果不确定，请检查type(input_string)。

Answer 2

我刚刚在Python 3.6.6上测试了您的测试，它按预期工作。我所做的唯一不同的事情是改为使用re.compile。看：

Python 3.6.6 (default, Jul 19 2018, 14:25:17) 
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> zipcode = re.compile(r'.*([0-9０-９]{3}[-ー]{1}[0-9０-９]{4}).*')
>>> zipcode.match("this is obviously not going to work")
>>> zipcode.match("this is going to work 123-4567")
<_sre.SRE_Match object; span=(0, 30), match='this is going to work 123-4567'>
>>> zipcode.match("this is going to work 123-4567").group(0)
'this is going to work 123-4567'
>>> zipcode.match("this is going to work 123-4567").group(1)
'123-4567'
>>> zipcode.match("this is going to work too １２３ー４５６７").group(1)
'１２３ー４５６７'
>>> zipcode.match("This should not work, but it does :  １２３４ー５６７").group(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
>>> zipcode.match("This should not work, but it does :  １２３４ー５６７")
>>> zipcode.match("Now just seems crazy ....... 京都府")
>>> zipcode.match("京都府")
>>>

编辑

这是我到目前为止所拥有的：

$ cat ziptest.py 
# -*- coding: utf-8 -*-
import re
zipcode = re.compile(r'.*([0-9０１２３４５６７８９]{3}[-ー]{1}[0-9０１２３４５６７８９]{4}).*')
tests = (
    "this is obviously not going to work",
    "this is going to work 123-4567",
    "this is going to work too １２３ー４５６７",
    "This will not work, as it should not :  1234-567",
    "This should not work, but it does :  １２３４ー５６７",
    "Now just seems crazy ....... 京都府",
    "京都府",
    "\xe4\xba\xac\xe9\x83\xbd\xe5\xba\x9c"
)

for test in tests:
    print('%s: %s' % (test, "Match" if zipcode.match(test) else "No match"))
$

结果如下：

$ python2.7 ziptest.py 
this is obviously not going to work: No match
this is going to work 123-4567: Match
this is going to work too １２３ー４５６７: Match
This will not work, as it should not :  1234-567: No match
This should not work, but it does :  １２３４ー５６７: Match
Now just seems crazy ....... 京都府: No match
京都府: No match
京都府: No match

$ python3.6 ziptest.py 
this is obviously not going to work: No match
this is going to work 123-4567: Match
this is going to work too １２３ー４５６７: Match
This will not work, as it should not :  1234-567: No match
This should not work, but it does :  １２３４ー５６７: No match
Now just seems crazy ....... 京都府: No match
京都府: No match
äº¬é½åº: No match

希望对您有帮助。

正则表达式使用不正确的日语字符返回true

2 个答案: