Question

我有一个输入（包括unicode）：

s = "Question1: a12 is the number of a, b1 is the number of cầu thủ"

我想获取所有不包含数字且超过2个字符的单词，希望输出：

['is', 'the', 'number', 'of', 'is', 'the', 'number', 'of', 'cầu', 'thủ']。

我尝试过

re.compile('[\w]{2,}').findall(s)

得到

'Question1', 'a12', 'is', 'the', 'number', 'of', 'b1', 'is', 'the', 'number', 'of', 'cầu', 'thủ'

有没有办法只获取没有数字的单词？

Answer 1

您可以使用

import re
s = "Question1: a12 is the number of a, b1 is the number of cầu thủ"
print(re.compile(r'\b[^\W\d_]{2,}\b').findall(s))
# => ['is', 'the', 'number', 'of', 'is', 'the', 'number', 'of', 'cầu', 'thủ']

或者，如果您只想将ASCII字母单词的字符数限制为至少2个，则为：

print(re.compile(r'\b[a-zA-Z]{2,}\b').findall(s))

请参见Python demo

详细信息

要仅匹配字母，您需要使用[^\W\d_]（或r'[a-zA-Z]仅ASCII形式）
要匹配整个单词，您需要单词边界\b
要确保在正则表达式模式中定义单词边界而不是退格字符，请使用原始字符串文字r'...'。

因此，r'\b[^\W\d_]{2,}\b'定义了一个正则表达式，该正则表达式与一个单词边界，两个或更多个字母匹配，然后断言在这两个字母之后没有单词char。

Answer 2

使用str.isalpha：

s = "Question1: a12 is the number of a, b1 is the number of cầu thủ"
[c for c in re.findall('\w{2,}', s) if c.isalpha()]

输出：

['is', 'the', 'number', 'of', 'is', 'the', 'number', 'of', 'cầu', 'thủ']

如何获得所有不包含数字的特定长度的单词？

2 个答案: