Question

我正在寻找解决此问题的有效方法

假设我们要在忽略大小写的字符串中查找单词列表，但是与其存储匹配的字符串，我们不希望使用与原始列表大小写相同的字符串。

例如：

words_to_match = ['heLLo', 'jumP', 'TEST', 'RESEARCH stuff']
text = 'hello this is jUmp test jump and research stuff'
# Result should be {'TEST', 'heLLo', 'jumP', 'RESEARCH stuff'}

这是我目前的做法：

words_to_match = ['heLLo', 'jumP', 'TEST', 'RESEARCH stuff']

我将其转换为以下正则表达式：

regex = re.compile(r'\bheLLo\b|\bjumP\b|\bTEST\b|\bRESEARCH stuff\b', re.IGNORECASE)

然后

word_founds = re.findall(regex,'hello this is jUmp test jump and research stuff')
normalization_dict = {w.lower():w for w in words_to_match}
# normalization dict : {'hello': 'heLLo', 'jump': 'jumP', 'test': 'TEST', 'research stuff': 'RESEARCH stuff'}
final_list = [normalization_dict[w.lower()] for w in word_founds]
# final_list : ['heLLo', 'jumP', 'TEST', 'jumP', 'RESEARCH stuff']
final_result = set(final_list)
# final_result : {'TEST', 'heLLo', 'jumP', 'RESEARCH stuff'}

这是我的预期结果，我只想知道是否有更快/更优雅的方法来解决此问题。

Answer 1

如果您仍然可以使用正则表达式，则可以单行完成。

results = set(word for word in re.findall(r"[\w']+", text) if word.lower() in [w.lower() for w in words_to_match])

这里使用的只是根据单词边界分割text变量。

编辑：

您还可以使用：

import string
results = set(word for word in "".join(c if c not in string.punctuation else " " for c in text).split() 
              if word.lower() in [w.lower() for w in words_to_match])

如果您想避免导入re，但是必须使用string。

编辑2 ：（希望在正确阅读问题之后）

results = set(word for word in words_to_match if word.lower() in text.lower())

这也适用于多字搜索。

编辑3：

results = set(word for word in words_to_match if re.search(r"\b" + word.lower() + r"\b", text.lower()))

Answer 2

尝试一下：

words_to_match = ['heLLo', 'jumP', 'TEST'] 
text = 'hello this is jUmp test jump'
result = set()
for str in words_to_match:
    if str.lower() in text.lower():
        result.add(str)

正则表达式忽略大小写匹配，但保留特定大小写的结果

2 个答案: