无效的标识符验证

Question

Python 3有一个名为str.isidentifier

的字符串方法

如何在Python 2.6中获得类似的功能，而不是重写我自己的正则表达式等？

Answer 1

tokenize模块定义名为Name

的正则表达式

import re, tokenize, keyword
re.match(tokenize.Name + '$', somestr) and not keyword.iskeyword(somestr)

Answer 2

re.match(r'[a-z_]\w*$', s, re.I)

应该做得很好。据我所知，没有任何内置方法。

Answer 3

在Python＆lt; 3.0这很容易，因为你不能在标识符中使用unicode字符。那应该做的工作：

import re
import keyword

def isidentifier(s):
    if s in keyword.kwlist:
        return False
    return re.match(r'^[a-z_][a-z0-9_]*$', s, re.I) is not None

Answer 4

到目前为止答案很好。我会这样写的。

import keyword
import re

def isidentifier(candidate):
    "Is the candidate string an identifier in Python 2.x"
    is_not_keyword = candidate not in keyword.kwlist
    pattern = re.compile(r'^[a-z_][a-z0-9_]*$', re.I)
    matches_pattern = bool(pattern.match(candidate))
    return is_not_keyword and matches_pattern

Answer 5

我已经决定对此采取另一个措施，因为有几个很好的建议。我会尝试巩固它们。以下内容可以保存为Python模块，并直接从命令行运行。如果运行，它会测试函数，因此可证明是正确的（至少在文档证明了这一功能的程度上）。

import keyword
import re
import tokenize

def isidentifier(candidate):
    """
    Is the candidate string an identifier in Python 2.x
    Return true if candidate is an identifier.
    Return false if candidate is a string, but not an identifier.
    Raises TypeError when candidate is not a string.

    >>> isidentifier('foo')
    True

    >>> isidentifier('print')
    False

    >>> isidentifier('Print')
    True

    >>> isidentifier(u'Unicode_type_ok')
    True

    # unicode symbols are not allowed, though.
    >>> isidentifier(u'Unicode_content_\u00a9')
    False

    >>> isidentifier('not')
    False

    >>> isidentifier('re')
    True

    >>> isidentifier(object)
    Traceback (most recent call last):
    ...
    TypeError: expected string or buffer
    """
    # test if candidate is a keyword
    is_not_keyword = candidate not in keyword.kwlist
    # create a pattern based on tokenize.Name
    pattern_text = '^{tokenize.Name}$'.format(**globals())
    # compile the pattern
    pattern = re.compile(pattern_text)
    # test whether the pattern matches
    matches_pattern = bool(pattern.match(candidate))
    # return true only if the candidate is not a keyword and the pattern matches
    return is_not_keyword and matches_pattern

def test():
    import unittest
    import doctest
    suite = unittest.TestSuite()
    suite.addTest(doctest.DocTestSuite())
    runner = unittest.TextTestRunner()
    runner.run(suite)

if __name__ == '__main__':
    test()

Answer 6

无效的标识符验证

此线程中的所有答案似乎都在验证中重复一个错误，该错误允许将不是有效标识符的字符串进行匹配。

其他答案中建议的正则表达式模式是从tokenize.Name构建的，它包含以下正则表达式模式[a-zA-Z_]\w*（运行python 2.7.15）和'$'正则表达式锚。

请参考official python 3 description of the identifiers and keywords（其中也包含与python 2相关的段落）。

在ASCII范围（U + 0001..U + 007F）中，标识符的有效字符与Python 2.x中的相同：大写和小写字母A至Z，下划线_和，除了第一个字符，数字0到9。

因此'foo \ n'不应被视为有效的标识符。

虽然有人可能会认为此代码是有效的：

>>>  class Foo():
>>>     pass
>>> f = Foo()
>>> setattr(f, 'foo\n', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo\n']
>>> print getattr(f, 'foo\n')
bar

由于换行符确实是有效的ASCII字符，因此不将其视为字母。此外，显然没有实际应用以换行符结尾的标识符

>>> f.foo\n
SyntaxError: unexpected character after line continuation character

str.isidentifier函数还确认这是一个无效的标识符：

python3解释器：

>>> print('foo\n'.isidentifier())
False

`$`锚与`\Z`锚

引用official python2 Regular Expression syntax：

$

匹配字符串的结尾或在字符串结尾处的换行符之前，并且在MULTILINE模式下也匹配换行符之前。 foo同时匹配“ foo”和“ foobar”，而正则表达式foo $仅匹配“ foo”。更有趣的是，在'foo1 \ nfoo2 \ n'中搜索foo。$通常会匹配'foo2'，但在MULTILINE模式下会匹配'foo1'；在'foo \ n'中搜索单个$将发现两个（空）匹配项：一个在换行符之前，一个在字符串末尾。

这将导致以换行符结尾的字符串作为有效标识符进行匹配：

>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '$', 'foo\n')
<_sre.SRE_Match at 0x3eac8e0>
>>> print m.group()
'foo'

正则表达式模式不应使用$锚，而应使用\Z作为锚。再次报价：

\Z

仅匹配字符串的末尾。

现在正则表达式是有效的：

>>> re.match(tokenize.Name + r'\Z', 'foo\n') is None
True

危险含义

请参见Luke's answer的另一个示例，这种弱的正则表达式匹配在其他情况下如何可能带来更危险的影响。

进一步阅读

Python 3添加了对非ASCII标识符的支持，请参见PEP-3131。

Answer 7

我在用什么：

def is_valid_keyword_arg(k):
    """
    Return True if the string k can be used as the name of a valid
    Python keyword argument, otherwise return False.
    """
    # Don't allow python reserved words as arg names
    if k in keyword.kwlist:
        return False
    return re.match('^' + tokenize.Name + '$', k) is not None

Answer 8

到目前为止提出的所有解决方案都不支持Unicode，或者如果在Python 3 上运行，则允许在第一个char 中使用数字。

编辑：建议的解决方案应仅用于Python 2，并且应使用Python3 isidentifier。这是一个应该适用于任何地方的解决方案：

re.match(r'^\w+$', name, re.UNICODE) and not name[0].isdigit()

基本上，它测试某些东西是否包含（至少1个）字符（包括数字），然后检查第一个字符是否不是数字。

如何在Python 2.6中获得Python isidentifer（）功能？

8 个答案:

无效的标识符验证

`$`锚与`\Z`锚

危险含义

进一步阅读

如何在Python 2.6中获得Python isidentifer（）功能？

8 个答案:

无效的标识符验证

$锚与\Z锚

危险含义

进一步阅读

`$`锚与`\Z`锚