Question

给定一个名称字符串，我想验证一些基本条件： - 角色属于公认的剧本/字母（拉丁文，中文，阿拉伯文等），并且不能说是表情符号。 - 字符串不包含数字，长度为＆lt; 40

我知道后者可以通过正则表达式完成，但是有一种unicode方法可以完成第一次吗？我可以利用哪些文本处理库？

Answer 1

您应该可以使用正则表达式中的Unicode Character classes进行检查。

[\p{P}\s\w]{40,}

这里最重要的部分是使用Unicode模式的\ w字符类：

\p{P}匹配任何类型的标点字符
  \s匹配任何类型的隐形字符（等于[\p{Z}\h\v]）
  \w匹配任何脚本中的任何单词字符（等于[\p{L}\p{N}_]）

Live Demo

您可能希望添加更多内容，例如\p{Sc}以匹配货币符号等。

但是为了能够使用advantage of this，您需要使用支持具有\p{}语法的Unicode代码点属性的regex模块（标准re模块的替代模块）。 / p>

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import regex as re

regex = r"[\p{P}\s\w]{40,}"

test_str = ("Wow cool song!Wow cool song!Wow cool song!Wow cool song!  \nWow cool song! Wow cool song! Wow cool song! \n")   
matches = re.finditer(regex, test_str, re.UNICODE | re.MULTILINE)

for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

PS：.NET Regex为您提供了更多选项，例如\ p {IsGreek}。

Python - 国际名称的基本验证？

1 个答案: