即使我是python的新手,我也无法理解我是如何解决这个问题/采取正确的方法。因此,任何帮助,链接到有用的教程都会受到高度赞赏,因为我不时会做这种事情。
我有一个CSV文件,我需要重新格式化/修改一下。
我需要存储基因所在的样本量。
输入文件:
AHCTF1: Sample1, Sample2, Sample4
AHCTF1: Sample2, Sample7, Sample12
AHCTF1: Sample5, Sample6, Sample7
结果:
AHCTF1 in 7 samples (Sample1, Sample2, Sample4, Sample5, Sample6, Sample7, Sample12)
代码:
f = open("/CSV-sorted.csv")
gene_prev = ""
hit_list = []
csv_f = csv.reader(f)
for lines in csv_f:
#time.sleep(0.1)
gene = lines[0]
sample = lines[11].split(",")
repeat = lines[8]
for samples in sample:
hit_list.append(samples)
if gene == gene_prev:
for samples in sample:
hit_list.append(samples)
print gene
print hit_list
print set(hit_list)
print "samples:", len(set(hit_list))
hit_list = []
gene_prev = gene
因此,简而言之,我想将每个基因的命中结合起来,并从中删除重复。
也许字典可以做到这一点:将ave基因作为关键并将样本添加为值?
发现这个 - 相似/有用:How can I combine dictionaries with the same keys in python?
答案 0 :(得分:1)
删除重复项的标准方法是转换为set
。
但是我觉得你读文件的方式有些不对劲。第一个问题:它不是csv文件(前两个字段之间有冒号)。
gene = lines[0]
sample = lines[11].split(",")
repeat = lines[8]
应该做什么?
如果我写这篇文章,我会将“:”替换为另一个“,”。因此,通过此修改并使用集合字典,您的代码将类似于:
# Read in csv file and convert to list of list of entries. Use with so that
# the file is automatically closed when we are done with it
csvlines = []
with open("CSV-sorted.csv") as f:
for line in f:
# Use strip() to clean up trailing whitespace, use split() to split
# on commas.
a = [entry.strip() for entry in line.split(',')]
csvlines.append(a)
# I'll print it here so you can see what it looks like:
print(csvlines)
# Next up: converting our list of lists to a dict of sets.
# Create empty dict
sample_dict = {}
# Fill in the dict
for line in csvlines:
gene = line[0] # gene is first entry
samples = set(line[1:]) # rest of the entries are samples
# If this gene is in the dict already then join the two sets of samples
if gene in sample_dict:
sample_dict[gene] = sample_dict[gene].union(samples)
# otherwise just put it in
else:
sample_dict[gene] = samples
# Now you can print the dictionary:
print(sample_dict)
输出结果为:
[['AHCTF1', 'Sample1', 'Sample2', 'Sample4'], ['AHCTF1', 'Sample2', 'Sample7', 'Sample12'], ['AHCTF1', 'Sample5', 'Sample6', 'Sample7']]
{'AHCTF1': {'Sample12', 'Sample1', 'Sample2', 'Sample5', 'Sample4', 'Sample7', 'Sample6'}}
第二行是你的字典。