匹配嵌套字典中的值

时间:2014-06-25 11:23:09

标签: python python-3.x dictionary list-comprehension

我有两个包含嵌套子词典的词典。它们的结构如下:

search_regions = {
    'chr11:56694718-71838208': {'Chr': 'chr11', 'End': 71838208, 'Start': 56694718},
    'chr13:27185654-39682032': {'Chr': 'chr13', 'End': 39682032, 'Start': 27185654}
}

database_variants = {
    'chr11:56694718-56694718': {'Chr': 'chr11', 'End': 56694718, 'Start': 56694718},
    'chr13:27185659-27185659': {'Chr': 'chr13', 'End': 27185659, 'Start': 27185659}
}

我需要比较它们并删除database_variants中的字典  它属于search_regions中字典的范围。

我正在构建一个执行此操作的函数(linked to a previous question)。这就是我到目前为止所做的:

def region_to_variant_location_match(search_Variants, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as input.
    Match variants in database_Variants to regions within search_Variants.
    Return matches as a nested dictionary.'''
    #Match on Chr value
        #Where Start value from database_variant is between St and End values in 
        search_variants.
    #return as nested dictionary

我遇到的问题是如何获取嵌套字典(Chr,St,End等)中的值以进行比较。我希望使用列表理解来实现这一点,因为我已经获得了相当多的数据,因此更简单的for循环可能会更耗时。

非常感谢任何帮助!

更新

我已尝试实施以下bioinfoboy建议的解决方案。我的第一步是使用以下函数将search_regions和database_variants字典转换为defaultdict(list):

def search_region_converter(searchDict):
    '''This function takes the dictionary of dictionaries and converts it to a
    DefaultDict(list) to allow matching   
    with the database in a corresponding format'''
    search_regions = defaultdict(list)
    for i in search_regions.keys():
        chromosome = i.split(":")[0]
        start = int(i.split(":")[1].split("-")[0])
        end = int(i.split(":")[1].split("-")[1])
        search_regions[chromosome].append((start, end))
    return search_regions #a list with chromosomes as keys 

def database_snps_converter(databaseDict):
    '''This function takes the dictionary of dictionaries and converts it to a
    DefaultDict(list) to allow matching   
    with the serach_snps in a corresponding format'''
    database_variants = defaultdict(list)
    for i in database_variants.keys():
        chromosome = i.split(":")[0]
        start = int(i.split(":")[1].split("-")[0])
        database_variants[chromosome].append(start)
    return database_variants #list of database variants 

然后我创建了一个匹配函数(再次使用bioinfoboy的代码),如下所示:

def region_to_variant_location_match(search_Regions, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as 
    input.                                         
    Match variants in database_Variants to regions within search_Variants.'''
    for key, values in database_Variants.items():
        for value in values:
            for search_area in search_Regions[key]:
                print(search_area)
                if (value >= search_area[0]) and (value <= search_area[1]):
                    yield(key, search_area)

然而,defaultdict函数返回空字典,我无法解决我需要改变的问题。

有什么想法吗?

2 个答案:

答案 0 :(得分:1)

您应该做类似

的事情
def region_to_variant_location_match(search_Variants, database_Variants):
    '''Take dictionaries for search_Variants and database_Variants as input.
    Match variants in database_Variants to regions within search_Variants.
    Return matches as a nested dictionary.'''
    return {
        record[0]: record[1]
        for record, lookup in zip(
            database_Variants.items(),
            search_Variants.items()
        )
        if (
            record[1]['Chr'] == lookup[1]['Chr'] and 
            lookup[1]['Start'] <= record[1]['Start'] <= lookup[1]['End']
        )
    }

请注意,如果您使用的是Python 2.7或更低版​​本(而不是Python 3),那么您将使用iteritems()代替items()itertools.izip()而不是zip,并且如果您使用的数量少于2.6,则需要切换到传递给dict()而不是dict理解的生成器理解。

答案 1 :(得分:1)

我想这可能会有所帮助

我根据我在评论中提到的内容转换了您的search_regionsdatabase_variants

from collections import defaultdict
_database_variants = defaultdict(list)
_search_regions = defaultdict(list)
for i in database_variants.keys():
    _chromosome = i.split(":")[0]
    _start = int(i.split(":")[1].split("-")[0])
    _database_variants[_chromosome].append(_start)
_search_regions = defaultdict(list)
for i in search_regions.keys():
    _chromosome = i.split(":")[0]
    _start = int(i.split(":")[1].split("-")[0])
    _end = int(i.split(":")[1].split("-")[1])
    _search_regions[_chromosome].append((_start, _end))

def _search(_database_variants, _search_regions):
    for key, values in _database_variants.items():
        for value in values:
            for search_area in _search_regions[key]:
                if (value >= search_area[0]) and (value <= search_area[1]):
                    yield(key, search_area)

我已经使用了yield,因此会返回一个可以迭代的生成器对象。考虑到您最初在问题中提供的数据,我得到以下输出。

for i in _search(_database_variants, _search_regions):
    print(i)

输出如下:

('chr11', (56694718, 71838208))
('chr13', (27185654, 39682032))

这不是你想要达到的目标吗?