从文本文件中检索特定行并使用它们来创建字典

时间:2017-11-16 10:28:29

标签: python database dictionary text bioinformatics

我想使用文本文件中的某些行创建字典。这里是我的文本文件中的示例(每个'页面'由' //'分隔):

//
UNIQUE-ID - INDOLE-3-ACETYL-BETA-4-D-GLUCOSE
TYPES - Compounds
COMMON-NAME - 4-<i>O</i>-(indol-3-ylacetyl)-&beta;-D-glucose
DBLINKS - (HMDB "HMDB12213" NIL |kothari| 3594494404 NIL NIL)
INCHI - InChI=1S/C16H19NO7/c18-7-11-15(13(20)14(21)16(22)23-11)24-12(19)5-8-6-17-10-4-2-1-3-9(8)10/h1-4,6,11,13-18,20-22H,5,7H2/t11-,13-,14-,15-,16-/m1/s1
SMILES - C(C3(C(OC(=O)CC1(=CNC2(C=CC=CC1=2)))C(O)C(O)C(O)O3))O
SYNONYMS - indole-3-acetyl-&beta;-4-D-glucose
//
UNIQUE-ID - CPD-6783
TYPES - Myo-inositol-bisphosphates
COMMON-NAME - D-<i>myo</i>-inositol (2,4) bisphosphate
DBLINKS - (HMDB "HMDB03905" NIL |kothari| 3608597114 NIL NIL)
DBLINKS - (PUBCHEM "25245743" NIL |taltman| 3466375284 NIL NIL)
INCHI - InChI=1S/C6H14O12P2/c7-1-2(8)5(17-19(11,12)13)4(10)6(3(1)9)18-20(14,15)16/h1-10H,(H2,11,12,13)(H2,14,15,16)/p-4/t1-,2-,3+,4-,5-,6-/m0/s1
//

字典的行应以&INCHI开头 - InChI =&#39;作为一个密钥和包含DBLINKS的行 - (HMDB&#39;作为值。这是我的代码:

data = open('compounds.dat', 'r', errors = 'ignore')
compounds = data.readlines()

INCHI_list = []
HMDB_list = []
dict_Inchi_HMDB = dict()

for i,line in enumerate(compounds):
    if 'INCHI - InChI=' in line:
        INCHI_list.append(line)

    if 'DBLINKS - (HMDB' in line:
        HMDB_list.append(line)
dict_Inchi_HMDB[INCHI_list] = HMDB_list

这给了我错误不可用的类型:&#39; list&#39;。我理解为什么我会收到错误,但我无法想出更好的方法......有人可以帮忙吗?

请注意:&#39; DBLINKS - (HMDB&#39;总是来之前&#39; INCHI - InChI =&#39;但两者之间的行数可能会有所不同。此外,数据文件包含数百个这些例子。而且行&#39; DBLINKS - (HMDB&#39;并未出现在每个&#39;页面&#39;

1 个答案:

答案 0 :(得分:1)

result = {}
with open('compounds.dat', errors = 'ignore') as my_file:  # Using with open one does not need to run my_file.close() after, that is taken care of for you.
    for page in my_file.read().split('//'):  # Since each page is delimited by // lets split on that and create a list of pages.
        for line in reversed(page.split('\n')):  # Split on every newline and reverse the list so the INCHI line comes up first, we'll use it as the key.
            k, v = line.split(' - ') # if line is e.g. UNIQUE-ID - CPD-6783, this creates k = 'UNIQUE-ID', v = 'CPD-6783'
            if 'INCHI - ' in line:  # We found our key line.
                curr_k = v  # Let's remember the key so that when we hit the  DBLINKS - (HMDB line, we can assign the value to the correct key.
            elif 'DBLINKS - (HMDB' in line: # We found our value line
                result[curr_k] = v  # We found the value and we remember what key to assign it to (curr_k)