python正则表达式,用于在两个字符串或短语之间查找内容

时间:2012-06-26 07:40:56

标签: python regex

如何在python中使用正则表达式捕获两个字符串或短语之间的东西,并删除该行上的其他内容?

例如,以下是以单行标题开头的蛋白质序列。如何根据短语“FlyBase_Annotation_IDs:”之后和下一个逗号“,”之前的规定,从下面的标题中筛选“CG33289-PC”?

我需要用这个简化的结果“CG33289-PC”替换标题,而不是破坏蛋白质序列(在所有大写的标题行下面找到)。

这是每个蛋白质序列条目的样子 - 标题后跟一个序列:

  

> FBpp0293870 type = protein; loc = 3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID = FBpp0293870;命名= CG33289-PC;父= FBgn0053289,FBtr0305327; dbxref = FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5 = 478485a27487608aa2b6c35d39a3295c;长度= 405;释放= r5.45;物种= DMEL;   MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII   GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE   SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET   FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ   RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID   QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL   LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN   RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN   FSRAV

这是所需的输出:

  

CG33289-PC
  MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII   GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE   SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET   FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ   RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID   QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL   LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN   RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN   FSRAV

4 个答案:

答案 0 :(得分:2)

使用正则表达式:

>>> s = """>FBpp0293870 type=protein;loc=3L:join(21527760..21527913,21527977..21528076,21528130..21528390,21528443..21528653,21528712..21529192,21529254..21529264); ID=FBpp0293870; name=CG33289-PC; parent=FBgn0053289,FBtr0305327; dbxref=FlyBase:FBpp0293870,FlyBase_Annotation_IDs:CG33289-PC; MD5=478485a27487608aa2b6c35d39a3295c; length=405; release=r5.45; species=Dmel; MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV"""
>>> import re
>>> print re.sub(r'.*FlyBase_Annotation_IDs:([\w-]+).*;', r'\1\n', s)
CG33289-PC
 MEMLKYVISDNNYSWWIKLYFAIIFALVLFVAVNLAVGIYNKWDSTPVII
GISSKMTPIDQIPFPTITVCNMNQAKKSKVEHLMPGSIRYAMLQKTCYKE
SNFSQYMDTQHRNETFSNFILDVSEKCADLIVSCIFHQQRIPCTDIFRET
FVDEGLCCIFNVLHPYYLYKFKSPYIRDFTSSDRFADIAVDWDPISGYPQ
RLPSSYYPRPGVGVGTSMGLQIVLNGHVDDYFCSSTNGQGFKILLYNPID
QPRMKESGLPVMIGHQTSFRIIARNVEATPSIRNIHRTKRQCIFSDEQEL
LFYRYYTRRNCEAECDSMFFLRLCSCIPYYLPLIYPNASVCDVFHFECLN
RAESQIFDLQSSQCKEFCLTSCHDLIFFPDAFSTPFSQKDVKAQTNYLTN
FSRAV
>>> 

答案 1 :(得分:1)

不是一个优雅的解决方案,但这应该适合你:

>>> fly = 'FlyBase_Annotation_IDs'
>>> repl = 'CG33289-PC'
>>> part1, part2 = protein.split(fly)
>>> part2 = part2.replace(repl, "FooBar")
>>> protein = fly.join([part1, part2])

假设FlyBase_Annotation_IDs只能在数据中出现一次。

答案 2 :(得分:1)

我不确定该文件的格式,但此正则表达式将捕获示例中的数据:

"FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);"

使用findall功能获得匹配。

答案 3 :(得分:1)

假设标题后面有换行符:

>>> import re
>>> protein = "..."
>>> r = re.compile(r"^.*FlyBase_Annotation_IDs:([A-Z0-9a-z-]*);.*$", re.MULTILINE)
>>> r.sub(r"\1", protein)

正则表达式中的组([A-Z0-9a-z-]*)提取任何字母数字字符和短划线。如果id可以包含其他字符,只需添加它们即可。