我有一个包含序列信息的文件。每个序列都有一些行。序列由五条白线分开。我想将文件更改为列表,并将其拆分为5个换行符。所以我有一个列表,每个序列作为一个元素。然后我想删除不包含正则表达式的序列。最后,我想要一个列表,只包含包含正则表达式的序列。
现在我有了这个。任何人都可以帮助我吗?
import re
def main():
ReadFile()
file = open ("filename.txt", "r")
CreateList(file, data)
RegEx(file, data)
def ReadFile()
try:
file = open ("filename.txt", "r")
except IOError:
print ("Can't open the file")
except:
print ("Something went wrong.")
def CreateList(file, data)
data = file.readlines()
data = data.split('\n\n\n\n\n')
def RegEx(file, data)
regex = ("[AG].{4}GK[ST]")
for element in data:
if regex not in element:
data.remove(element)
print (data)
main()
文件看起来像:
Hits for PS00017|ATP_GTP_A (pattern) ATP/GTP-binding site motif A (P-loop) : [occurs frequently]
Pattern: [AG]-x(4)-G-K-[ST]
Approximate number of expected random matches in ~ 100'000 sequences (50'000'000 residues): 3371
>sp|Q6GZX2|003R_FRG3G (438 aa)
Uncharacterized protein 3R. [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
2 - 9: ArpllGKT
>sp|Q6GZX1|004R_FRG3G (60 aa)
Uncharacterized protein 004R. [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
33 - 40: GyyydGKT
>sp|Q6GZW0|015R_FRG3G (322 aa)
Uncharacterized protein 015R. [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
34 - 41: GkpgsGKS',
>sp|P32234|128UP_DROME (368 aa)
GTP-binding protein 128up. [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
71 - 78: GfpsvGKS
应该是数据(但只包含含有RegEx的蛋白质):
['>sp|Q6GZX2|003R_FRG3G (438 aa)
Uncharacterized protein 3R. [Frog virus 3 (isolate Goorha) (FV-3)]
MARPLLGKTSSVRRRLESLSACSIFFFLRKFCQKMASLVFLNSPVYQMSNILLTERRQVDRAMGGSDDDGVMVVALSPSD
FKTVLGSALLAVERDMVHVVPKYLQTPGILHDMLVLLTPIFGEALSVDMSGATDVMVQQIATAGFVDVDPLHSSVSWKDN
VSCPVALLAVSNAVRTMMGQPCQVTLIIDVGTQNILRDLVNLPVEMSGDLQVMAYTKDPLGKVPAVGVSVFDSGSVQKGD
AHSVGAPDGLVSFHTHPVSSAVELNYHAGWPSNVDMSSLLTMKNLMHVVVAEEGLWTMARTLSMQRLTKVLTDAEKDVMR
AAAFNLFLPLNELRVMGTKDSNNKSLKTYFEVFETFTIGALMKHSGVTPTAFVDRRWLDNTIYHMGFIPWGRDMRFVVEY
DLDGTNPFLNTVPTLMSVKRKAKIQEMFDNMVSRMVTS
2 - 9: ArpllGKT',
'>sp|Q6GZX1|004R_FRG3G (60 aa)
Uncharacterized protein 004R. [Frog virus 3 (isolate Goorha) (FV-3)]
MNAKYDTDQGVGRMLFLGTIGLAVVVGGLMAYGYYYDGKTPSSGTSFHTASPSFSSRYRY
33 - 40: GyyydGKT',
'>sp|Q6GZW0|015R_FRG3G (322 aa)
Uncharacterized protein 015R. [Frog virus 3 (isolate Goorha) (FV-3)]
MEQVPIKEMRLSDLRPNNKSIDTDLGGTKLVVIGKPGSGKSTLIKALLDSKRHIIPCAVVISGSEEANGFYKGVVPDLFI
YHQFSPSIIDRIHRRQVKAKAEMGSKKSWLLVVIDDCMDNAKMFNDKEVRALFKNGRHWNVLVVIANQYVMDLTPDLRSS
VDGVFLFRENNVTYRDKTYANFASVVPKKLYPTVMETVCQNYRCMFIDNTKATDNWHDSVFWYKAPYSKSAVAPFGARSY
WKYACSKTGEEMPAVFDNVKILGDLLLKELPEAGEALVTYGGKDGPSDNEDGPSDDEDGPSDDEEGLSKDGVSEYYQSDL
DD
34 - 41: GkpgsGKS',
'>sp|P32234|128UP_DROME (368 aa)
GTP-binding protein 128up. [Drosophila melanogaster (Fruit fly)]
MSTILEKISAIESEMARTQKNKATSAHLGLLKAKLAKLRRELISPKGGGGGTGEAGFEVAKTGDARVGFVGFPSVGKSTL
LSNLAGVYSEVAAYEFTTLTTVPGCIKYKGAKIQLLDLPGIIEGAKDGKGRGRQVIAVARTCNLIFMVLDCLKPLGHKKL
LEHELEGFGIRLNKKPPNIYYKRKDKGGINLNSMVPQSELDTDLVKTILSEYKIHNADITLRYDATSDDLIDVIEGNRIY
IPCIYLLNKIDQISIEELDVIYKIPHCVPISAHHHWNFDDLLELMWEYLRLQRIYTKPKGQLPDYNSPVVLHNERTSIED
FCNKLHRSIAKEFKYALVWGSSVKHQPQKVGIEHVLNDEDVVQIVKKV
71 - 78: GfpsvGKS']
答案 0 :(得分:0)
import re
file = open("ploop.txt")
text = file.read()
file.close()
proteins = text.split("\n\n")[1:]
proteinsMatching = []
toWrite = ""
for protein in proteins:
if re.search(r"[AG].{4}GK[ST]", protein):
proteinsMatching.append(protein)
for protein in proteinsMatching:
accensionCode = re.findall(r">sp\|(.{6})", protein)[0]
organism = re.findall(r"\n.+?\[(.+?)\]", protein)[0]
print(accensionCode, organism)
toWrite += accensionCode + " " + organism + "\n"
f = open("results.txt", "w+")
f.write(toWrite)
f.close()
# Q6GZX2 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZX1 Frog virus 3 (isolate Goorha) (FV-3)
# Q6GZW0 Frog virus 3 (isolate Goorha) (FV-3)
# P32234 Drosophila melanogaster (Fruit fly)
Regex1(将文本文件拆分为蛋白质列表:) https://regex101.com/r/gU0gX5/1
Regex2(你的正则表达式显示它们都匹配)https://regex101.com/r/nZ0pD6/1