如何在忽略特殊字符的字符串后找到下一个9个字符?

时间:2019-04-26 11:20:05

标签: python regex string

考虑以下字符串:

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'

基本上,我需要在字符串中找到字符“ NRC”,“ AZN”,“ BSA”和“ SSR”的位置。然后,我需要提取接下来的9个数字。忽略任何非数字字符。因此它应该返回

在某些情况下,数字5错误地写为S,数字2错误地写为Z。我仍然需要识别这些情况,并将错误的S和Z分别更改为5和2。

result = ['NRC234456789', 'AZN123456789' , 'BSA123456789', 'SSR789456123']

我有正在使用的这段代码

list_comb = ['NRC', 'AZN', 'BSA', 'SSR'] 
def findWholeWord(w): 
    return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search 

它返回找到字符串的位置。.但是我不确定下一步如何进行。 谢谢

3 个答案:

答案 0 :(得分:0)

使用此regex识别模式。也许可以帮忙:

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.2.3.4.5.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
regex = re.findall("([A-Z0-9.\s\/]{2,})",str_test)
result = []

如果非数字字符仅用点,逗号和斜杠表示的一种解决方案:

for r in regex:
    result.append(r.replace(".","").replace(" ","").replace("/",""))
print (result)

或者如果非数字字符可以是任何数字,请使用此循环:

for r in regex:
    result.append(re.sub("([^\d\w])","",r))
print (result)

输出:

['NRC234456789', 'AZN123456789', 'BSA123456789', 'SSR789456123']

已更新

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
regex = re.findall("([A-Z]{3})([A-Z0-9.\s\/]{2,})",str_test)
result = []
for r in regex:
    result.append(r[0]+("".join(re.sub("([^\d\w])","",str(r[1])).replace("Z","2").replace("S","5"))))

print (result)

输出:

['NRC234456789', 'AZN123456789', 'BSA123456789', 'SSR789456123']

答案 1 :(得分:0)

这是一种方法

例如:

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.2.3.4.5.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
to_check = ['NRC', 'AZN', 'BSA', 'SSR']
pattern = re.compile("("+"|".join(to_check) + ")([\d+\.\s\/]+)")

for k, v in pattern.findall(str_test):
    print(k + re.sub(r"[^\d]", "", v))

输出:

NRC234456789
AZN123456789
BSA123456789
SSR789456123

根据评论进行编辑。

import re

str_test = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and final case SSR/789456123'
to_check = ['NRC', 'AZN', 'BSA', 'SSR']
pattern = re.compile("("+"|".join(to_check) + ")([\d+\.\s\/ZS]+)")

for k, v in pattern.findall(str_test):
    new_val = k + re.sub(r"[^\d]", "", v.replace("Z", "2").replace("S", "5"))
    print(new_val)

答案 2 :(得分:0)

这是使用此正则表达式首先查找所需文本的简单方法,

\b(?:NRC|AZN|BSA|SSR)(?:.?\d)+

使用提供的列表动态生成,然后从列表中删除所有非字母数字字符。

编辑: 要处理错误的字符串,其中2被错误地写为Z,而5被错误地写为S,则可以在字符串的第二部分替换它们,而忽略开头的三个字符。另外,代码已更新,因此只选择下一个九位数字,而不是更多。这是我为之更新的Python代码,

import re

s = 'This is a sample text NRC234456789 and this is another case AZN.1.Z.3.4.S.6.7.8.9 and this another case BSA 123 456 789 and BSA 123 456 789 123 456 final case SSR/789456123'

list_comb = ['NRC', 'AZN', 'BSA', 'SSR']
regex = r'\b(?:{})(?:.?[\dA-Z])+'.format('|'.join(list_comb))
print(regex)

for m in re.findall(regex, s):
 m = re.sub(r'[^a-zA-Z0-9]+', '', m)
 mat = re.search(r'^(.{3})(.{9})', m)
 if mat:
  s1 = mat.group(1)
  s2 = mat.group(2).replace('S','5').replace('Z','2')
  print(s1+s2)

打印校正后的值,其中S替换为5,而Z替换为2

NRC234456789
AZN123456789
BSA123456789
BSA123456789
SSR789456123