Question

我正在尝试清理字符串，使其没有任何标点或数字，它必须只有a-z和A-Z。例如，给定的String是：

"coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"

必需的输出是：

['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

我的解决方案是

re.findall(r"([A-Za-z]+)" ,string)

我的输出是

['coMPuter', 'scien', 'tist', 's', 'are', 'the', 'rock', 'stars', 'of', 'tomorrow', 'cool']

Answer 1

您不需要使用正则表达式：

（如果你想要所有小写单词，请将字符串转换为小写），拆分单词，然后过滤掉以字母开头的单词：

>>> s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"
>>> [filter(str.isalpha, word) for word in s.lower().split() if word[0].isalpha()]
['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

在Python 3.x中，filter(str.isalpha, word)应替换为''.join(filter(str.isalpha, word))，因为在Python 3.x中，filter会返回一个过滤器对象。

Answer 2

根据所有回答的人的建议，我得到了我真正想要的正确解决方案，感谢每一个......

s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"    
cleaned = re.sub(r'(<.*>|[^a-zA-Z\s]+)', '', s).split()
print cleaned

Answer 3

使用re，虽然我不确定这是你想要的，因为你说你不想要＆＃34;很酷＆＃34;剩。

import re

s = "coMPuter scien_tist-s are,,,  the  rock__stars of tomorrow_ <cool>  ????"

REGEX = r'([^a-zA-Z\s]+)'

cleaned = re.sub(REGEX, '', s).split()
# ['coMPuter', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow', 'cool']

修改

WORD_REGEX = re.compile(r'(?!<?\S+>)(?=\w)(\S+)') CLEAN_REGEX = re.compile(r'([^a-zA-Z])') def cleaned(match_obj): return re.sub(CLEAN_REGEX, '', match_obj.group(1)).lower() [cleaned(x) for x in re.finditer(WORD_REGEX, s)] # ['computer', 'scientists', 'are', 'the', 'rockstars', 'of', 'tomorrow']

WORD_REGEX对任何单词字符使用正向前瞻，对＆lt; ...＆gt;使用负向前瞻。无论通过前瞻的非空白区域都被分组：

(?!<?\S+>) # negative lookahead (?=\w) # positive lookahead (\S+) #group non-whitespace

cleaned获取匹配组并删除CLEAN_REGEX
的所有非单词字符

正则表达式跳过某些特定字符

3 个答案: