我有一个文件(我只显示了一部分),我想删除一个特殊字符。
OTU1359 UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2 UniRef90_A0A1W1CJV7 UniRef90_A0A1Z9J2X0 UniRef90_A0A1Z9THL2 UniRef90_A0A2E3B6A5 UniRef90_A0A2E5MT47 UniRef90_A0A2E5VCW9 UniRef90_A0A2E6CDK4 UniRef90_A0A2E6KTE6 UniRef90_A0A2E8AIM6 UniRef90_A0A2E8RIG1 UniRef90_A0A2E8YNS3 UniRef90_A0A2E9VEK0 UniRef90_W6RCT6
OTU0980 UniRef90_A0A084TMQ7 UniRef90_A0A090PK65 UniRef90_A0A0P1G8P0 UniRef90_A0A0P1IHL1 UniRef90_A0A286ILS7 UniRef90_A0A2A5E7H9 UniRef90_A0A2D9J217 UniRef90_H3NS47 UniRef90_H3NSN9 UniRef90_H3NSP0 UniRef90_H3NSP7 UniRef90_H3NUB2 UniRef90_H3NY28 UniRef90_H3NY47 UniRef90_UPI000C2CBC51
我想删除字符“ OTUXXXX”(它始终以OTU开头,并且始终以4个数字开头)。它可以按行显示多个OTUXXXX
我尝试过:
re.search("OTU[0-9]{4}", line)
它不起作用。有什么帮助吗?
答案 0 :(得分:1)
您可以利用re.sub
来实际执行替换或用您提供的文本替换匹配的文本。您在这里找到文档:https://docs.python.org/3/library/re.html
这里是一种可能的实现方式:
from re import compile, sub, MULTILINE
text = '''
OTU1359 UniRef90_A0A095VQ09 UniRef90_A0A0C1UI80 UniRef90_A0A1M4ZSK2 UniRef90_A0A1W1CJV7 UniRef90_A0A1Z9J2X0 UniRef90_A0A1Z9THL2 UniRef90_A0A2E3B6A5 UniRef90_A0A2E5MT47 UniRef90_A0A2E5VCW9 UniRef90_A0A2E6CDK4 UniRef90_A0A2E6KTE6 UniRef90_A0A2E8AIM6 UniRef90_A0A2E8RIG1 UniRef90_A0A2E8YNS3 UniRef90_A0A2E9VEK0 UniRef90_W6RCT6
OTU0980 UniRef90_A0A084TMQ7 UniRef90_A0A090PK65 UniRef90_A0A0P1G8P0 UniRef90_A0A0P1IHL1 UniRef90_A0A286ILS7 UniRef90_A0A2A5E7H9 UniRef90_A0A2D9J217 UniRef90_H3NS47 UniRef90_H3NSN9 UniRef90_H3NSP0 UniRef90_H3NSP7 UniRef90_H3NUB2 UniRef90_H3NY28 UniRef90_H3NY47 UniRef90_UPI000C2CBC51
'''
replacemnt = ''
regex = compile(r'OTU\d{4}', flags=MULTILINE)
cleaned = sub(regex, replacemnt, text)
答案 1 :(得分:0)
我建议使用re.sub
并找到整个单词的模式匹配项,以避免其他单词内部出现部分匹配项。
s = re.sub(r"\s*\bOTU[0-9]{4}\b", "", line).strip()
请参见regex demo。末尾的.strip()
会删除在字符串的末尾/开头删除的匹配项之后剩余的多余的前导/后缀空格。
请参见regex graph: