在两个巨大的词典中有效地找到交叉区域

时间:2014-05-05 09:31:11

标签: python dictionary bioinformatics

我写了一段代码,找到两个不同文件的行[1]中的公共ID。我的输入文件很大(2万行)。如果我将它拆分成许多小文件,它会给我更多相交的ID,而如果我把整个文件扔掉,那就更少了。我无法弄清楚为什么,你能告诉我什么是错的,以及如何改进这些代码以避免这个问题?

fileA = open("file1.txt",'r')
fileB = open("file2.txt",'r')
output = open("result.txt",'w')

dictA = dict()
for line1 in fileA:
    listA = line1.split('\t')
    dictA[listA[1]] = listA

dictB = dict()
for line1 in fileB:
    listB = line1.split('\t')
    dictB[listB[1]] = listB

for key in dictB:
    if key in dictA:
        output.write(dictA[key][0]+'\t'+dictA[key][1]+'\t'+dictB[key][4]+'\t'+dictB[key][5]+'\t'+dictB[key][9]+'\t'+dictB[key][10])

我的file1按行[0]排序,有0-15行,

contig17    GRMZM2G052619_P03  98 109 2 0 15 67 78.8 0 127 5 420 0 304 45
contig33    AT2G41790.1        98 420 2 0 21 23 78.8 1 127 5 420 2 607 67
contig98    GRMZM5G888620_P01  87 470 1 0 17 28 78.8 1 127 7 420 2 522 18  
contig102   GRMZM5G886789_P02  73 115 1 0 34 45 78.8 0 134 5 421 0 456 50  
contig123   AT3G57470.1        83 201 2 1 12 43 78.8 0 134 9 420 0 305 50

我的文件2没有排序,有0-10行,

GRMZM2G052619 GRMZM2G052619_P03 4 2345 GO:0043531 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07525  1        
GRMZM5G888620 GRMZM5G888620_P01 1 2367 GO:0011551 DNA binding "Any molecular function by which a gene product interacts selectively and non-covalently with DNA" [GOC:jl] molecular_function PF07589  4    
GRMZM5G886789 GRMZM5G886789_P02 1 4567 GO:0055516 ADP binding "Interacting selectively and non-covalently with ADP" [GOC:jl] molecular_function PF07526 0    

我想要的输出,

contig17    GRMZM2G052619_P03  GO:0043531 ADP binding molecular_function PF07525
contig98    GRMZM5G888620_P01  GO:0011551 DNA binding molecular_function PF07589 
contig102   GRMZM5G886789_P02  GO:0055516 ADP binding molecular_function PF07526  

2 个答案:

答案 0 :(得分:2)

我真的建议你使用PANDAS来解决这类问题。

可以使用pandas完成的证明:

import pandas as pd  #install this, and read de docs
from StringIO import StringIO #You dont need this

#simulating a reading the file 
first_file = """contig17 GRMZM2G052619_P03 x
contig33 AT2G41790.1 x
contig98 GRMZM5G888620_P01 x
contig102 GRMZM5G886789_P02 x
contig123 AT3G57470.1 x"""

#simulating reading the second file
second_file = """y GRMZM2G052619_P03 y
y GRMZM5G888620_P01 y
y GRMZM5G886789_P02 y"""

#here is how you open the files. Instead using StringIO
#you will simply the file path. Give the correct separator
#sep="\t" (for tabular data). Here im using a space.
#In name, put some relevant names for your columns
f_df = pd.read_table(StringIO(first_file), 
                     header=None, 
                     sep=" ", 
                     names=['a', 'b', 'c'])
s_df = pd.read_table(StringIO(second_file), 
                     header=None, 
                     sep=" ", 
                     names=['d', 'e', 'f'])
#this is the hard bit. Here I am using  a bit of my experience with pandas
#Basicly it select the rows in the second data frame, which "isin"
#in the second columns for each data frames. 
my_df = s_df[s_df.e.isin(f_df.b)]

输出: 出[180]:

    d   e                   f
0   y   GRMZM2G052619_P03   y
1   y   GRMZM5G888620_P01   y
2   y   GRMZM5G886789_P02   y
#you can save this with:
my_df.to_csv("result.txt", sep="\t")

chers!

答案 1 :(得分:1)

这几乎是相同的,但在一个功能中。

#Creates a function to do the reading for each file
def read_store(file_, dictio_): 
    """Given a file name and a dictionary stores the values
    of the file in a dictionary by its value on the column provided."""
    import re 
    with open(file_,'r') as file_0:
        lines_file_0 = fileA.readlines()
    for line in lines_file_0:
        ID = re.findall("^.+\s+(\w+)", line) 
    #I couldn't check it but it should match whatever is after a separate
    # character that has letters, numbers or underscore
        dictio_[ID] = line

使用do:

file1 = {}
read_store("file1.txt", file1)

然后像往常一样比较它,但我会使用\s代替\t进行拆分。虽然它也会在单词之间分开,但很容易与" ".join(DictA[1:5])

重新加入