Question

我正在使用来自电子邮件或主题正文的字符串列表提取代码。看起来像这样：

text_list = ['RV: Final model review and algorithm COde 053 and also with CODE52','CODE22/coDe129','CODE178/coDe029']

到目前为止，我尝试过的是：

def containsDigit(word):
    if re.search("\d", word):
        return word

regex = re.compile('[CcOoDdEe]{4,}')
codes = []
codes_found = []

for text in text_list:
    codes_found.append(regex.findall(text))
    for code in codes_found:
        codes.append(containsDigit(code))

我的问题是，我无法提取代码旁边或其中带有''空格的数字。

我想要的输出是：

codes = ['COde 053', 'CODE52','CODE22','coDe129','CODE178','coDe029']

Answer 1

您可以使用

import re
text_list = ['RV: Final model review and algorithm COde 053 and also with CODE52','CODE22/coDe129','CODE178/coDe029']
rx = re.compile(r'\bcode\s*\d+', re.I)
res = []
for text in text_list:
    m = rx.findall(text)
    if len(m) > 0:
        res.extend(m)

print(res)
# => ['COde 053', 'CODE52', 'CODE22', 'coDe129', 'CODE178', 'coDe029']

请参见Python demo

re.compile(r'\bcode\s*\d+', re.I)正则表达式以不区分大小写的方式（由于code）匹配re.I作为整个单词（由于\b的单词边界），然后匹配0+空格（{{1）}，然后再加上1个数字（\s*）。

使用正则表达式提取代码（不规则代码）

1 个答案: