如何检查列表元素是否包含Python中的正则表达式?

时间:2016-04-16 11:53:30

标签: python regex list

我有一个包含序列信息的文件。每个序列都有一些行。序列由五条白线分开。我想将文件更改为列表,并将其拆分为5个换行符。所以我有一个列表,每个序列作为一个元素。然后我想删除不包含正则表达式的序列。最后,我想要一个列表,只包含包含正则表达式的序列。

现在我有了这个。任何人都可以帮助我吗?

import re
def main():
    ReadFile()
    file = open ("filename.txt", "r")
    CreateList(file, data)
    RegEx(file, data)

def ReadFile()
    try:
        file = open ("filename.txt", "r")
    except IOError:
        print ("Can't open the file")
    except:
        print ("Something went wrong.")

def CreateList(file, data)
    data = file.readlines()
    data = data.split('\n\n\n\n\n')

def RegEx(file, data)
    regex = ("[AG].{4}GK[ST]") 
    for element in data:
        if regex not in element: 
            data.remove(element) 
    print (data) 

main()

文件看起来像:

Hits for PS00017|ATP_GTP_A (pattern) ATP/GTP-binding site motif A (P-loop) :  [occurs frequently]
   Pattern: [AG]-x(4)-G-K-[ST]
   Approximate number of expected random matches in ~ 100'000 sequences (50'000'000 residues): 3371


>sp|Q6GZX2|003R_FRG3G  (438 aa)
Uncharacterized protein 3R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
      2 - 9:          ArpllGKT


>sp|Q6GZX1|004R_FRG3G  (60 aa)
Uncharacterized protein 004R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
      33 - 40:        GyyydGKT


>sp|Q6GZW0|015R_FRG3G  (322 aa)
Uncharacterized protein 015R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
      34 - 41:        GkpgsGKS',


>sp|P32234|128UP_DROME  (368 aa)
GTP-binding protein 128up.  [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
      71 - 78:        GfpsvGKS

应该是数据(但只包含含有RegEx的蛋白质):

['>sp|Q6GZX2|003R_FRG3G  (438 aa)
Uncharacterized protein 3R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
      2 - 9:          ArpllGKT',


'>sp|Q6GZX1|004R_FRG3G  (60 aa)
Uncharacterized protein 004R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
      33 - 40:        GyyydGKT',


'>sp|Q6GZW0|015R_FRG3G  (322 aa)
Uncharacterized protein 015R.  [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
      34 - 41:        GkpgsGKS',


'>sp|P32234|128UP_DROME  (368 aa)
GTP-binding protein 128up.  [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
      71 - 78:        GfpsvGKS']

1 个答案:

答案 0 :(得分:0)

import re
file = open("ploop.txt")
text = file.read()
file.close()

proteins = text.split("\n\n")[1:]
proteinsMatching = []
toWrite = "" 

for protein in proteins:
    if re.search(r"[AG].{4}GK[ST]", protein):
        proteinsMatching.append(protein)        


for protein in proteinsMatching:
    accensionCode = re.findall(r">sp\|(.{6})", protein)[0]
    organism = re.findall(r"\n.+?\[(.+?)\]", protein)[0]
    print(accensionCode, organism)
    toWrite += accensionCode + " " + organism + "\n"

f = open("results.txt", "w+")
f.write(toWrite)
f.close()

# Q6GZX2 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZX1 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZW0 Frog virus 3 (isolate Goorha) (FV-3)
# P32234 Drosophila melanogaster (Fruit fly)

(再次)更新了新要求

Regex1(将文本文件拆分为蛋白质列表:) https://regex101.com/r/gU0gX5/1

Regex2(你的正则表达式显示它们都匹配)https://regex101.com/r/nZ0pD6/1