创建没有重复的字典

时间:2019-01-22 18:47:00

标签: python dictionary

我正在尝试从没有重复的CSV创建字典。 CSV文件包含:样品名称(s1,s2等)基因名称,样品1突变的影响,样品2突变的影响。这是CSV文件的两行示例:

s1, s2, gene1, MODERATE, HIGH
s3, s4, gene2, HIGH, MODERATE

我的目标是获得有关特定基因突变的样本数量的摘要,然后得出该突变是否对HIGH产生影响的总结。

例如:

gene12  7   ['s1', 's3', 's4', 's10 [HIGH]', 's17', 's19', 's24 [HIGH]']
gene20  2   ['s10 [HIGH]', 's21']

当前我的代码如下:

import os
import sys

path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
    split_lines = line.split(", ")
    gene = split_lines[2]
    sample1 = split_lines[0]
    sample2 = split_lines[1]
    impact1 = split_lines[3]
    impact2 = split_lines[4]
    for i in range(0, len(read_csv):
        if gene in gene_dict:
            if impact1 == "HIGH":
                gene_dict[gene].append(sample1+" [HIGH]")
            if impact2 == "HIGH":
                gene_dict[gene].append(sample2+" [HIGH]")
            else:
                gene_dict[gene].append(sample1)
                gene_dict[gene].append(sample2)
        else:
            gene_dict[gene] = [sample1]

final_dict = {a:list(set(b)) for a, b in gene_dict.items()}

for key, value in final_dict.items():
    genename = key
    num_samples = len([item for item in value if item])
    samples = value     
    print(genename,num_samples,samples)

我的脚本可以正常工作,除了我得到重复的样本。我的意思是,如果样本中的基因具有高影响突变,那么最终摘要将两次列出样本。以下是我的意思的示例:

gene12  8   ['s1', 's3', 's4', 's10 [HIGH]', 's17', 's19', 's24', 's24 [HIGH]']
gene20  3   ['s10', 's10 [HIGH]', 's21']

这可能是我创建导致重复的字典的方式,但我无法弄清楚。您会看到,对于gene12,s24被列出两次,从而消除了计数。对于带有s10的gene20也是如此。样品被列出两次,一次是正确的具有高影响突变,另一次是没有高影响突变。但是,s24仅在gene12中具有HIGH影响突变,而s10仅在gene20中具有HIGH影响突变。我希望这是有道理的。我可以澄清是否需要。预先感谢您提供的所有帮助!

2 个答案:

答案 0 :(得分:2)

好像您的内循环for i in range(0, len(read_csv):正在复制并添加无用的匹配。另外,if / if / else结构和添加[HIGH]的标记看起来也很损坏。

更正的版本:

import os
import sys

path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
    split_lines = line.split(", ")
    gene = split_lines[2]
    sample1 = split_lines[0]
    sample2 = split_lines[1]
    impact1 = split_lines[3]
    impact2 = split_lines[4]
    if impact1 == "HIGH":
        sample1 = sample1 + " [HIGH]"
    if impact2 == "HIGH":
        sample2 = sample2 + " [HIGH]"

    if gene in gene_dict:
        gene_dict[gene].append(sample1)
        gene_dict[gene].append(sample2)
    else:
        gene_dict[gene] = [sample1, sample2]

final_dict = {a:list(set(b)) for a, b in gene_dict.items()}

for key, value in final_dict.items():
    genename = key
    num_samples = len([item for item in value if item])
    samples = value     
    print(genename,num_samples,samples)

对于我尝试的几个示例,这看起来是一致的。

答案 1 :(得分:0)

我会创建一个像这样的类:

class Sample:
    def __init__(self, name, level="low",):
        self.level = level
        self.name = name

    def __eq__(self, equal):
        if equal.name == self.name:
            return True
        return False

类似的东西(无法测试):

import os
import sys

path = ("path/to/csv")
open_csv = open(path+"csvfile", "r")
read_csv = open_csv.read().splitlines()
gene_dict = {}
for line in read_csv:
    split_lines = line.split(", ")
    gene = split_lines[2]
    sample1 = Sample(split_lines[0])
    sample2 = Sample(split_lines[1])
    impact1 = split_lines[3]
    impact2 = split_lines[4]
    for i in range(0, len(read_csv):
        if gene in gene_dict:
            if not sample1 in gene_dict[gene]:
                if impact1 == "HIGH":
                    sample1.level = impact1
                    gene_dict[gene].append(sample1)

                else:
                    gene_dict[gene].append(sample1)

            if not sample2 in gene_dict[gene]:
                if impact2 == "HIGH":
                    sample2.level = impact2
                    gene_dict[gene].append(sample2)
                else:
                    gene_dict[gene].append(sample2)
        else:
            gene_dict[gene] = [sample1]