如何读取两个制表符分隔文件.txt并将它们一起映射到一个公共列。
例如,从这两个文件中创建基因到途径的映射:
第一个文件,pathway.txt
Pathway Protein
Binding and Uptake of Ligands by Scavenger Receptors P69905
Erythrocytes take up carbon dioxide and release oxygen P69905
Metabolism P69905
Amyloids P02647
Metabolism P02647
Hemostasis P68871
第二个文件,gene.txt
Gene Protein
Fabp3 P11404
HBA1 P69905
APOA1 P02647
Hbb-b1 P02088
HBB P68871
Hba P01942
输出就像,
Gene Protein Pathway
Fabp3 P11404
HBA1 P69905 Binding and Uptake of Ligands by Scavenger Receptors, Erythrocytes take up carbon dioxide and release oxygen, Metabolism
APOA1 P02647 Amyloids, Metabolism
Hbb-b1 P02088
HBB P68871 Hemostasis
Hba P01942
如果没有途径对应于蛋白质id信息的基因,则留空。
更新:
import pandas as pd
file1= pd.read_csv("gene.csv")
file2= pd.read_csv("pathway.csv")
output = pd.concat([file1,file2]).fillna(" ")
output= output[["Gene","Protein"]+list(output.columns[1:-1])]
output.to_csv("mapping of gene to pathway.csv", index=False)
所以这只给了我合并的文件,这不是我预期的。
答案 0 :(得分:0)
>>> from collections import defaultdict
>>> my_dict = defaultdict()
>>> f = open('pathway.txt')
>>> for x in f:
... x = x.strip().split()
... value,key = " ".join(x[:-1]),x[-1]
... if my_dict.get(key,0)==0:
... my_dict[key] = [value]
... else:my_dict[key].append(value)
...
>>> my_dict
defaultdict(None, {'P68871': ['Hemostasis'], 'Protein': ['Pathway'], 'P69905': ['Binding', 'Erythrocytes', 'Metabolism'], 'P02647': ['Amyloids', 'Metabolism']})
>>> f1 = open('gene.txt')
>>> for x in f1:
... value,key = x.strip().split()
... if my_dict.get(key,0)==0:
... print("{:<15}{:<15}".format(value,key))
... else: print("{:<15}{:<15}{}".format(value,key,", ".join(my_dict[key])))
...
Gene Protein Pathway
Fabp3 P11404
HBA1 P69905 Binding and Uptake of Ligands by Scavenger Receptors, Erythrocytes take up carbon dioxide and release oxygen Metabolism
APOA1 P02647 Amyloids, Metabolism
Hbb-b1 P02088
HBB P68871 Hemostasis
Hba P01942
答案 1 :(得分:0)
class Protein:
def __init__(self, protein, pathway = None, gene = ""):
self.protein = protein
self.pathways = []
self.gene = gene
if pathway is not None:
self.pathways.append(pathway)
return
def __str__(self):
return "%s\t%s\t%s" % (
self.gene,
self.protein,
", ".join([p for p in self.pathways]))
# protein -> pathway map
proteins = {}
# get the pathways
f1 = file("pathways.txt")
for line in f1.readlines()[1:]:
tokens = line.split()
pathway = " ".join(tokens[:-1])
protein = tokens[-1]
if protein in proteins:
p = proteins[protein]
p.pathways.append(pathway)
else:
p = Protein(protein = protein, pathway = pathway)
proteins[protein] = p
# get the genes
f2 = file("genes.txt")
for line in f2.readlines()[1:]:
gene, protein = line.split()
if protein in proteins:
p = proteins[protein]
p.gene = gene
else:
p = Protein(protein = protein, gene = gene)
proteins[protein] = p
# print the results
print "Gene\tProtein\tPathway"
for protein in proteins.values():
print protein