我正在尝试从没有重复的CSV创建字典。 CSV文件包含:样品名称(s1,s2等)基因名称,样品1突变的影响,样品2突变的影响。这是CSV文件的两行示例:
s1, s2, gene1, MODERATE, HIGH
s3, s4, gene2, HIGH, MODERATE
我的目标是获得有关特定基因突变的样本数量的摘要,然后得出该突变是否对HIGH产生影响的总结。
例如:
gene12 7 ['s1', 's3', 's4', 's10 [HIGH]', 's17', 's19', 's24 [HIGH]']
gene20 2 ['s10 [HIGH]', 's21']
当前我的代码如下:
import os
import sys
path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
split_lines = line.split(", ")
gene = split_lines[2]
sample1 = split_lines[0]
sample2 = split_lines[1]
impact1 = split_lines[3]
impact2 = split_lines[4]
for i in range(0, len(read_csv):
if gene in gene_dict:
if impact1 == "HIGH":
gene_dict[gene].append(sample1+" [HIGH]")
if impact2 == "HIGH":
gene_dict[gene].append(sample2+" [HIGH]")
else:
gene_dict[gene].append(sample1)
gene_dict[gene].append(sample2)
else:
gene_dict[gene] = [sample1]
final_dict = {a:list(set(b)) for a, b in gene_dict.items()}
for key, value in final_dict.items():
genename = key
num_samples = len([item for item in value if item])
samples = value
print(genename,num_samples,samples)
我的脚本可以正常工作,除了我得到重复的样本。我的意思是,如果样本中的基因具有高影响突变,那么最终摘要将两次列出样本。以下是我的意思的示例:
gene12 8 ['s1', 's3', 's4', 's10 [HIGH]', 's17', 's19', 's24', 's24 [HIGH]']
gene20 3 ['s10', 's10 [HIGH]', 's21']
这可能是我创建导致重复的字典的方式,但我无法弄清楚。您会看到,对于gene12,s24被列出两次,从而消除了计数。对于带有s10的gene20也是如此。样品被列出两次,一次是正确的具有高影响突变,另一次是没有高影响突变。但是,s24仅在gene12中具有HIGH影响突变,而s10仅在gene20中具有HIGH影响突变。我希望这是有道理的。我可以澄清是否需要。预先感谢您提供的所有帮助!
答案 0 :(得分:2)
好像您的内循环for i in range(0, len(read_csv):
正在复制并添加无用的匹配。另外,if / if / else结构和添加[HIGH]
的标记看起来也很损坏。
更正的版本:
import os
import sys
path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
split_lines = line.split(", ")
gene = split_lines[2]
sample1 = split_lines[0]
sample2 = split_lines[1]
impact1 = split_lines[3]
impact2 = split_lines[4]
if impact1 == "HIGH":
sample1 = sample1 + " [HIGH]"
if impact2 == "HIGH":
sample2 = sample2 + " [HIGH]"
if gene in gene_dict:
gene_dict[gene].append(sample1)
gene_dict[gene].append(sample2)
else:
gene_dict[gene] = [sample1, sample2]
final_dict = {a:list(set(b)) for a, b in gene_dict.items()}
for key, value in final_dict.items():
genename = key
num_samples = len([item for item in value if item])
samples = value
print(genename,num_samples,samples)
对于我尝试的几个示例,这看起来是一致的。
答案 1 :(得分:0)
我会创建一个像这样的类:
class Sample:
def __init__(self, name, level="low",):
self.level = level
self.name = name
def __eq__(self, equal):
if equal.name == self.name:
return True
return False
类似的东西(无法测试):
import os
import sys
path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
split_lines = line.split(", ")
gene = split_lines[2]
sample1 = Sample(split_lines[0])
sample2 = Sample(split_lines[1])
impact1 = split_lines[3]
impact2 = split_lines[4]
for i in range(0, len(read_csv):
if gene in gene_dict:
if not sample1 in gene_dict[gene]:
if impact1 == "HIGH":
sample1.level = impact1
gene_dict[gene].append(sample1)
else:
gene_dict[gene].append(sample1)
if not sample2 in gene_dict[gene]:
if impact2 == "HIGH":
sample2.level = impact2
gene_dict[gene].append(sample2)
else:
gene_dict[gene].append(sample2)
else:
gene_dict[gene] = [sample1]