Question

我正在使用re.findall这样：

x=re.findall('\w+', text)

所以我得到一个与字符[a-zA-Z0-9]匹配的单词列表。问题是当我使用这个输入时：

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~:

我想得到一个空列表，但我得到[''，'']。怎么可能我排除那些下划线？

Answer 1

使用 [a-zA-Z0-9]模式; \w包含下划线：

x = re.findall('[a-zA-Z0-9]+', text)

或在添加了\w的否定字符集中使用\W，_的倒数：

x = re.findall('[^\W_]+', text)

后者具有即使使用re.UNICODE或re.LOCALE也能正常工作的优势，其中\w匹配更广泛的字符。

演示：

>>> import re
>>> text = '!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~:'
>>> re.findall('[^\W_]+', text)
[]
>>> re.findall('[^\W_]+', 'The foo bar baz! And the eggs, ham and spam?')
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']

Answer 2

您也可以使用groupby

from itertools import groupby
x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]

例如

>>> text = 'The foo bar baz! And the eggs, ham and spam?'
>>> x = [''.join(g) for k, g in groupby(text, str.isalnum) if k]
>>> x
['The', 'foo', 'bar', 'baz', 'And', 'the', 'eggs', 'ham', 'and', 'spam']

当需要在python中将字符串条带化为单词时使用re.findall

2 个答案: