我想解析一个大文本文件,该文件在换行符中以字符'//'分段。我的输入文件是这样的:
ID HRPA_ECOLI Reviewed; 130 AA.
AC P43329; P76861; P76863; P77479;
DE RecName: Full=ATP-dependent RNA helicase HrpA;
DE EC=3.6.4.13;
GN Name=hrpA; OrderedLocusNames=b1413, JW5905;
OS Escherichia coli (strain K12).
OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales;
OC Enterobacteriaceae; Escherichia.
OX NCBI_TaxID=83333;
DR RefSeq; NP_415931.4; NC_000913.3.
DR RefSeq; WP_000139543.1; NZ_LN832404.1.
DR ProteinModelPortal; P43329; -.
DR KEGG; ecj:JW5905; -.
DR KEGG; eco:b1413; -.
DR PATRIC; 32118112; VBIEscCol129921_1476.
DR KO; K03578; -.
DR GO; GO:0005737; C:cytoplasm; IBA:GO_Central.
DR GO; GO:0005524; F:ATP binding; IEA:UniProtKB-KW.
DR Gene3D; 3.40.50.300; -; 2.
DR InterPro; IPR003593; AAA+_ATPase.
DR InterPro; IPR011545; DEAD/DEAH_box_helicase_dom.
DR InterPro; IPR011709; DUF1605.
DR Pfam; PF00270; DEAD; 1.
DR Pfam; PF11898; DUF3418; 1.
DR SMART; SM00382; AAA; 1.
DR SMART; SM00487; DEXDc; 1.
DR SMART; SM00847; HA2; 1.
DR SMART; SM00490; HELICc; 1.
DR SUPFAM; SSF52540; SSF52540; 1.
DR TIGRFAMs; TIGR01967; DEAH_box_HrpA; 1.
DR PROSITE; PS51192; HELICASE_ATP_BIND_1; 1.
DR PROSITE; PS51194; HELICASE_CTER; 1.
PE 3: Inferred from homology;
KW ATP-binding; Complete proteome; Helicase; Hydrolase;
KW Nucleotide-binding; Reference proteome.
FT CHAIN 1 1300 ATP-dependent RNA helicase HrpA.
FT /FTId=PRO_0000055178.
FT DOMAIN 87 250 Helicase ATP-binding.
FT {ECO:0000255|PROSITE-ProRule:PRU00541}.
FT DOMAIN 274 444 Helicase C-terminal.
SQ SEQUENCE 1300 AA; 149028 MW; A26601266D771638 CRC64;
MTEQQKLTFT ALQQRLDSLM LRDRLRFSRR LHGVKKVKNP DAQQAIFQEM AKEIDQAAGK
VLLREAARPE ITYPDNLPVS QKKQDILEAI RDHQVVIVAG ETGSGKTTQL PKICMELGRG
IKGLIGHTQP
//
ID T1RK_ECOLI Reviewed; 1170 AA.
AC P08956; Q2M5W6;
DT 01-NOV-1988, integrated into UniProtKB/Swiss-Prot.
DT 24-NOV-2009, sequence version 3.
我还有一个id.txt文件,其中每一行都有一个唯一的ID,如:
NP_415931.4
...
我想将每个id与输入文件匹配,如果匹配,我想用正则表达式(使用输入文件的特定段)提取某些信息,并将它们保存在输出csv文件中。例如,对于匹配的字符“GO:[0-9]”,我想出了:
#!/usr/bin/env python
import re
import pdb
def peon(DATA, LIST, OUTPUT, sentinel = '\n//', pattern = re.compile('GO:[0-9]+')):
data = DATA.read()
for item in LIST:
find_me = item.strip()
j = 0
while True:
i = data.find(find_me, j)
if i < 0:
break
j = data.find(sentinel, i)
if j < 0:
j = len(data)
result = pattern.findall(data[i:j])
OUTPUT.write('{}\t{}\n'.format(find_me, ', '.join(result)))
def main(dataname, listname, outputname):
with open(dataname, 'rt') as DATA:
with open(listname, 'rt') as LIST:
with open(outputname, 'wt') as OUTPUT:
peon(DATA, LIST, OUTPUT)
if __name__ == '__main__':
main('./input_file.txt', './id.txt', './output.csv')
它给我的输出如下:
NP_415931.4 GO:0005737, GO:0005524
现在,我要匹配的字符是(Number&lt;&gt; Header&lt;&gt; Description),
1 RefSeq_ID As given in id.txt file
2 AA_Length In the line that starts with "ID" & ends with "AA."
3 Protein_Name After "RecName: Full="
4 EC_Number After "EC="
5 Organism In the line that starts with "OS"
6 NCBI_Taxid_ID After "NCBI_TaxID="
7 KEGG_ID After "KEGG;"
8 KO_ID After "KO;"
9 GO_ID As ''GO:[NUMBER]"
10 InterPro_ID After "InterPro;"
11 InterPro_Description After InterPro_ID , i.e, after 10
12 Pfam_ID After "Pfam;"
13 Pfam_Description After Pfam_ID, i.e, after 12
14 PROSITE_ID After "PROSITE;"
15 PROSITE_Description After PROSITE_ID, i.e, after 14
我还附上了一张照片,以便更好地澄清:
我想同时提取所有这些字符,并将它们保存在带有特定标题的输出csv文件中。在改变正则表达式后,我正在提取“AA_Length”,如:
pattern = re.compile('[0-9]+ AA.')
它给出了:
NP_415931.4 130 AA;
但它不完全是我需要的模式。另外,我不确定前后匹配的正则表达式以及如何在单个脚本中实现它们。
如何在单个脚本中搜索所有这些模式并将输出(带标题)保存在csv文件中?
谢谢
PS:我希望最终输出csv看起来像:
我的Excel工作表在这里:https://sites.google.com/site/iicbbioinformatics/share
答案 0 :(得分:1)
您的数据似乎没有固定的结构。
怎么样
DATA
,split
他们在"\n//"
,
基本上:
import re, csv
with open('./input_file.txt') as dfile:
DATA = dfile.read()
with open('./id.txt') as lfile:
IDS = lfile.read().split('\n')
headers = ['RefSeq_ID',
'AA_Length',
'Protein_Name',
'EC_Number',
'Organism',
'NCBI_Taxid_ID',
'KEGG_ID',
'KO_ID',
'GO_ID',
'InterPro_ID',
'InterPro_Description',
'Pfam_ID',
'Pfam_Description',
'PROSITE_ID',
'PROSITE_Description'
]
ofile = open('./output.csv', 'w')
csvfile = csv.DictWriter(ofile, headers)
csvfile.writeheader()
for DATASET in DATA.split('\n//'):
found_ids = {'RefSeq_ID': ""}
for RefSeq_ID in IDS:
if RefSeq_ID in DATASET:
found_ids['RefSeq_ID'] = RefSeq_ID
break
if not found_ids['RefSeq_ID']:
continue
found_ids['AA_Length'] = ", ".join(re.findall('^ID.+\s+(\d+) AA\.$', DATASET, re.MULTILINE))
found_ids['Protein_Name'] = ", ".join(re.findall('RecName: Full=(.+);', DATASET))
found_ids['EC_Number'] = ", ".join(re.findall('EC=([\d\.]+);', DATASET))
found_ids['Organism'] = ", ".join(re.findall('^OS\s+(.*)\.$', DATASET, re.MULTILINE))
found_ids['NCBI_Taxid_ID'] = ", ".join(re.findall('NCBI_TaxID=(\d+);', DATASET))
found_ids['KEGG_ID'] = ", ".join(re.findall('KEGG; (\w+:\w+\d+);', DATASET))
found_ids['KO_ID'] = ", ".join(re.findall('KO; (K\d+);', DATASET))
found_ids['GO_ID'] = ", ".join(re.findall('GO; (GO:\d+);', DATASET))
found_ids['InterPro_ID'] = ", ".join(re.findall('InterPro; (IPR\d+);', DATASET))
found_ids['InterPro_Description'] = ", ".join(re.findall('InterPro;.*?;(.*)\.', DATASET))
found_ids['Pfam_ID'] = " ".join(re.findall('Pfam; (PF\d+);', DATASET))
found_ids['Pfam_Description'] = ", ".join(re.findall('Pfam; PF\d+; (.*?);', DATASET))
found_ids['PROSITE_ID'] = ", ".join(re.findall('PROSITE; (PS\d+);', DATASET))
found_ids['PROSITE_Description'] = ", ".join(re.findall('PROSITE; PS\d+; (.*?);', DATASET))
csvfile.writerow(found_ids)
ofile.close()