Question

如何读取两个制表符分隔文件.txt并将它们一起映射到一个公共列。

例如，从这两个文件中创建基因到途径的映射：

第一个文件，pathway.txt

 Pathway                                                Protein
 Binding and Uptake of Ligands by Scavenger Receptors   P69905                                          
 Erythrocytes take up carbon dioxide and release oxygen P69905                                         
 Metabolism                                             P69905
 Amyloids                                               P02647
 Metabolism                                             P02647
 Hemostasis                                             P68871

第二个文件，gene.txt

 Gene   Protein
 Fabp3  P11404
 HBA1   P69905
 APOA1  P02647
 Hbb-b1 P02088
 HBB    P68871
 Hba    P01942

输出就像，

 Gene  Protein     Pathway
 Fabp3  P11404     
 HBA1   P69905     Binding and Uptake of Ligands by Scavenger Receptors, Erythrocytes take up carbon dioxide and release oxygen, Metabolism
 APOA1  P02647     Amyloids, Metabolism
 Hbb-b1 P02088
 HBB    P68871     Hemostasis
 Hba    P01942

如果没有途径对应于蛋白质id信息的基因，则留空。

更新：

 import pandas as pd

 file1= pd.read_csv("gene.csv")
 file2= pd.read_csv("pathway.csv")

 output = pd.concat([file1,file2]).fillna(" ")
 output= output[["Gene","Protein"]+list(output.columns[1:-1])]
 output.to_csv("mapping of gene to pathway.csv", index=False)

所以这只给了我合并的文件，这不是我预期的。

Answer 1

>>> from collections import defaultdict
>>> my_dict = defaultdict()
>>> f = open('pathway.txt')
>>> for x in f:
...     x = x.strip().split()
...     value,key = " ".join(x[:-1]),x[-1]
...     if my_dict.get(key,0)==0:
...         my_dict[key] = [value]
...     else:my_dict[key].append(value)
... 
>>> my_dict
defaultdict(None, {'P68871': ['Hemostasis'], 'Protein': ['Pathway'], 'P69905': ['Binding', 'Erythrocytes', 'Metabolism'], 'P02647': ['Amyloids', 'Metabolism']})
>>> f1 = open('gene.txt')
>>> for x in f1:
...     value,key = x.strip().split()
...     if my_dict.get(key,0)==0:
...         print("{:<15}{:<15}".format(value,key))
...     else: print("{:<15}{:<15}{}".format(value,key,", ".join(my_dict[key])))
... 
Gene           Protein        Pathway
Fabp3          P11404         
HBA1           P69905         Binding and Uptake of Ligands by Scavenger Receptors,  Erythrocytes take up carbon dioxide and release oxygen Metabolism
APOA1          P02647         Amyloids, Metabolism
Hbb-b1         P02088         
HBB            P68871         Hemostasis
Hba            P01942

Answer 2

class Protein:

    def __init__(self, protein, pathway = None, gene = ""):
        self.protein = protein
        self.pathways = []
        self.gene = gene

        if pathway is not None:
            self.pathways.append(pathway)
        return

    def __str__(self):
        return "%s\t%s\t%s" % (
            self.gene, 
            self.protein, 
            ", ".join([p for p in self.pathways]))      

# protein -> pathway map
proteins = {}

# get the pathways
f1 = file("pathways.txt")
for line in f1.readlines()[1:]: 
    tokens = line.split()
    pathway = " ".join(tokens[:-1])
    protein = tokens[-1]

    if protein in proteins:
        p = proteins[protein]
        p.pathways.append(pathway)
    else:
        p = Protein(protein = protein, pathway = pathway)
        proteins[protein] = p

# get the genes
f2 = file("genes.txt")
for line in f2.readlines()[1:]:
    gene, protein = line.split()

    if protein in proteins:
        p = proteins[protein]    
        p.gene = gene
    else:
        p = Protein(protein = protein, gene = gene)
        proteins[protein] = p

# print the results
print "Gene\tProtein\tPathway"
for protein in proteins.values():
    print protein

通过一个公共列python将两个文本文件合并在一起

2 个答案: