我有这个数据集:
Epitope,ID,Frequency,Assay
AVNIVGYSNAQGVDY,123431,27.0,Tetramer
DIKYTWNVPKI,887473,50.0,3H
LRQMRTVTPIRMQGG,34234,11.9,Elispot
AVNIVGYSNAQGVDY,3456,67.0,Tetramer
我想知道如何获取和输出这样的
d = {'AVNIVGYSNAQGVDY': [ID[123431,3456],Frequency[27.0,67.0],Assay['Tetramer']], 'DIKYTWNVPKI': [ID[887473],Frequency[50.0],Assay['3H']], 'LRQMRTVTPIRMQGG': [ID[34234],Frequency[11.9],Assay['Elispot']]}
这使得每个唯一的表位作为关键字的字典,它们的值列表,每个类别ID,频率和分析作为一个列表,其中包含重复的值,如您所见。
我可以使用以下代码阅读文件:
result = {}
for row in reader:
dictlist = []
key = row.pop('Epitope')
if key in result:
pass
result[key] = row
print result
但我不知道如何处理重复项,我的意思是,如果有重复项,如何附加ID,频率和分析。
答案 0 :(得分:1)
您需要将列表用作值并附加到每个列表中,每行按键:
from collections import defaultdict
result = defaultdict(lambda: defaultdict(list))
for row in reader:
epitope = row.pop('Epitope')
entry = result[epitope]
for key, value in row.items():
entry[key].append(value)
演示:
>>> from collections import defaultdict
>>> import csv
>>> from collections import defaultdict
>>> sample = '''\
... Epitope,ID,Frequency,Assay
... AVNIVGYSNAQGVDY,123431,27.0,Tetramer
... DIKYTWNVPKI,887473,50.0,3H
... LRQMRTVTPIRMQGG,34234,11.9,Elispot
... AVNIVGYSNAQGVDY,3456,67.0,Tetramer
... '''
>>> reader = csv.DictReader(sample.splitlines())
>>> result = defaultdict(lambda: defaultdict(list))
>>> for row in reader:
... epitope = row.pop('Epitope')
... entry = result[epitope]
... for key, value in row.items():
... entry[key].append(value)
...
>>> from pprint import pprint
>>> for key, value in result.items():
... print key, dict(value)
...
AVNIVGYSNAQGVDY {'Frequency': ['27.0', '67.0'], 'Assay': ['Tetramer', 'Tetramer'], 'ID': ['123431', '3456']}
DIKYTWNVPKI {'Frequency': ['50.0'], 'Assay': ['3H'], 'ID': ['887473']}
LRQMRTVTPIRMQGG {'Frequency': ['11.9'], 'Assay': ['Elispot'], 'ID': ['34234']}