我有两个包含嵌套子词典的词典。它们的结构如下:
search_regions = {
'chr11:56694718-71838208': {'Chr': 'chr11', 'End': 71838208, 'Start': 56694718},
'chr13:27185654-39682032': {'Chr': 'chr13', 'End': 39682032, 'Start': 27185654}
}
database_variants = {
'chr11:56694718-56694718': {'Chr': 'chr11', 'End': 56694718, 'Start': 56694718},
'chr13:27185659-27185659': {'Chr': 'chr13', 'End': 27185659, 'Start': 27185659}
}
我需要比较它们并删除database_variants中的字典 它属于search_regions中字典的范围。
我正在构建一个执行此操作的函数(linked to a previous question)。这就是我到目前为止所做的:
def region_to_variant_location_match(search_Variants, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as input.
Match variants in database_Variants to regions within search_Variants.
Return matches as a nested dictionary.'''
#Match on Chr value
#Where Start value from database_variant is between St and End values in
search_variants.
#return as nested dictionary
我遇到的问题是如何获取嵌套字典(Chr,St,End等)中的值以进行比较。我希望使用列表理解来实现这一点,因为我已经获得了相当多的数据,因此更简单的for循环可能会更耗时。
非常感谢任何帮助!
更新
我已尝试实施以下bioinfoboy建议的解决方案。我的第一步是使用以下函数将search_regions和database_variants字典转换为defaultdict(list):
def search_region_converter(searchDict):
'''This function takes the dictionary of dictionaries and converts it to a
DefaultDict(list) to allow matching
with the database in a corresponding format'''
search_regions = defaultdict(list)
for i in search_regions.keys():
chromosome = i.split(":")[0]
start = int(i.split(":")[1].split("-")[0])
end = int(i.split(":")[1].split("-")[1])
search_regions[chromosome].append((start, end))
return search_regions #a list with chromosomes as keys
def database_snps_converter(databaseDict):
'''This function takes the dictionary of dictionaries and converts it to a
DefaultDict(list) to allow matching
with the serach_snps in a corresponding format'''
database_variants = defaultdict(list)
for i in database_variants.keys():
chromosome = i.split(":")[0]
start = int(i.split(":")[1].split("-")[0])
database_variants[chromosome].append(start)
return database_variants #list of database variants
然后我创建了一个匹配函数(再次使用bioinfoboy的代码),如下所示:
def region_to_variant_location_match(search_Regions, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as
input.
Match variants in database_Variants to regions within search_Variants.'''
for key, values in database_Variants.items():
for value in values:
for search_area in search_Regions[key]:
print(search_area)
if (value >= search_area[0]) and (value <= search_area[1]):
yield(key, search_area)
然而,defaultdict函数返回空字典,我无法解决我需要改变的问题。
有什么想法吗?
答案 0 :(得分:1)
您应该做类似
的事情def region_to_variant_location_match(search_Variants, database_Variants):
'''Take dictionaries for search_Variants and database_Variants as input.
Match variants in database_Variants to regions within search_Variants.
Return matches as a nested dictionary.'''
return {
record[0]: record[1]
for record, lookup in zip(
database_Variants.items(),
search_Variants.items()
)
if (
record[1]['Chr'] == lookup[1]['Chr'] and
lookup[1]['Start'] <= record[1]['Start'] <= lookup[1]['End']
)
}
请注意,如果您使用的是Python 2.7或更低版本(而不是Python 3),那么您将使用iteritems()
代替items()
和itertools.izip()
而不是zip
,并且如果您使用的数量少于2.6,则需要切换到传递给dict()
而不是dict
理解的生成器理解。
答案 1 :(得分:1)
我想这可能会有所帮助
我根据我在评论中提到的内容转换了您的search_regions
和database_variants
。
from collections import defaultdict
_database_variants = defaultdict(list)
_search_regions = defaultdict(list)
for i in database_variants.keys():
_chromosome = i.split(":")[0]
_start = int(i.split(":")[1].split("-")[0])
_database_variants[_chromosome].append(_start)
_search_regions = defaultdict(list)
for i in search_regions.keys():
_chromosome = i.split(":")[0]
_start = int(i.split(":")[1].split("-")[0])
_end = int(i.split(":")[1].split("-")[1])
_search_regions[_chromosome].append((_start, _end))
def _search(_database_variants, _search_regions):
for key, values in _database_variants.items():
for value in values:
for search_area in _search_regions[key]:
if (value >= search_area[0]) and (value <= search_area[1]):
yield(key, search_area)
我已经使用了yield
,因此会返回一个可以迭代的生成器对象。考虑到您最初在问题中提供的数据,我得到以下输出。
for i in _search(_database_variants, _search_regions):
print(i)
输出如下:
('chr11', (56694718, 71838208))
('chr13', (27185654, 39682032))
这不是你想要达到的目标吗?