Question

操作系统：Windows 7. Jython 2.7.0＆＃34;最终版本＆＃34;。

for token in sorted_cased.keys():
    freq = sorted_cased[ token ]
    if freq > 1:
        print( 'token |%s| unicode? %s' % ( token, isinstance( token, unicode ), ) )
        if re.search( ur'\p{L}+', token ):
            print( '  # cased token |%s| freq %d' % ( token, freq, ))

sorted_cased是一个显示令牌发生频率的字典。在这里，我试图清除以频率＆gt;出现的单词（仅限unicode字符）。 1.（注意我使用re.match而不是search，但search应该检测事件1，例如token中的\ p {L}

示例输出：

token |Management| unicode? True
token |n| unicode? True
token |identifiés| unicode? True
token |décrites| unicode? True
token |agissant| unicode? True
token |tout| unicode? True
token |sociétés| unicode? True

没有人认识到它中只有一个[p {L}]。我尝试了各种排列：双引号，添加flags=re.UNICODE等等。

后我被要求解释为什么这不能被归类为How to implement \p{L} in python regex的副本。它可以，但......其他问题的答案并未引起人们对使用 REGEX MODULE （旧版本？非常新版本？NB它们不同）的需求的注意，而不是< strong> RE MODULE 。为了挽救毛囊和未来遇到这种毛囊的人的理智，我要求允许保留现在的段落，尽管问题是“欺骗”＃34;

此外我尝试安装Pypi正则表达式模块 FAILED UNDER JYTHON （使用pip）。可能更好地使用java.util.regex。

Answer 1

如果您有权访问Java java.util.regex，最好的选择是使用内置的\p{L}类。

Python（包括Jython方言）不支持\p{L}和其他Unicode类别类。也不是POSIX角色类。

另一种方法是限制\w类(?![\d_])\w并使用UNICODE标记。 If UNICODE is set, this \w will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.。这个替代方案有一个缺陷：它不能在字符类中使用。

另一个想法是使用[^\W\d_]（带re.U标志）来匹配任何不是非单词（\W），数字（\d）的字符和_ char。它将有效地匹配任何Unicode 字母。

为什么＆＃34; \ p {L}＆＃34;不在这个正则表达式工作？

1 个答案: