我有一本字典词典,里面有这样的项目
all={
1:{ ('a',123,145):20, ('a',155,170):12, ('b',234,345): 34},
2:{ ('a',121,135):10, ('a',155,175):28, ('b',230,345): 16},
3:{ ('a',130,140):20, ('a',150,170):10, ('b',234,345): 30},
...
n: {...}
}
编辑:字典名称由我根据读取初始数据的文件名任意给出,我可以使用我想要命名这些字典的任何值。 我想得到每个重叠区域的这些值的总和。显示重叠应该如何的输出是这个
{ ('a',121,122):10, ('a',123,130):30, ('a',131,135):50,
('a',136,140):40,('a',141,145):20, ...}
编辑:每个字典都有非重叠的间隔,因此在给定的字典中永远不会有('a',2,10)和('a',3,12),但是字典之间的间隔重叠为开始和结束位置不一样(即字典之间的键不一样)。
我不必使用字典数据结构,因为我首先创建了这个字典,如果这对列表,集合等更容易,我可以在其中一个结构中获取数据,I可以使用基于不同数据结构的另一种解决方案。
感谢您的帮助。
答案 0 :(得分:1)
好吧,现在我觉得我明白了:基本上你有一堆重叠的间隔,用一定厚度的某个位置的条形表示。您可以在彼此下方绘制这些条形图,并查看它们在任何给定点的厚度。
我认为滥用你有整数位置这一事实是最容易/最快的:
all={
1:{ ('a',123,145):20, ('a',155,170):12, ('b',234,345): 34},
2:{ ('a',121,135):10, ('a',155,175):28, ('b',230,345): 16},
3:{ ('a',130,140):20, ('a',150,170):10, ('b',234,345): 30}
}
from collections import defaultdict
summer = defaultdict(int)
mini, maxi = 0,0
for d in all.values():
for (name, start, stop), value in d.iteritems():
# im completely ignoring the `name` here, not sure if that's what you want
# else just separate the data before doing this ...
if mini == 0:
mini = start
mini, maxi = min(mini, start), max(maxi, stop)
for i in range(start, stop+1):
summer[i]+=value
# now we have the values at each point, very redundant but very fast so far
print summer
# now we can find the intervals:
def get_intervals(points, start, stop):
cstart = start
for i in range(start, stop+1):
if points[cstart] != points[i]: # did the value change ?
yield cstart, i-1, points[cstart]
cstart = i
if cstart != i:
yield cstart, i, points[cstart]
print list(get_intervals(summer, mini, maxi))
当仅使用'a'项时,它会给出:
[(121, 122, 10), (123, 129, 30), (130, 135, 50), (136, 140, 40), (141, 145, 20), (146, 149, 0), (150, 154, 10), (155, 170, 50), (171, 175, 28)]
from collections import defaultdict
from heapq import heappush, heappop
class Summer(object):
def __init__(self):
# its a priority queue, kind of like a sorted list
self.hq = []
def additem(self, start, stop, value):
# at `start` add it as a positive value
heappush(self.hq, (start, value))
# at `stop` subtract that value again
heappush(self.hq, (stop, -value))
def intervals(self):
hq = self.hq
start, val = heappop(hq)
while hq:
point, value = heappop(hq)
yield start, point, val
# just maintain the current value and where the interval started
val += value
start = point
assert val == 0
summers = defaultdict(Summer)
for d in all.values():
for (name, start, stop), value in d.iteritems():
summers[name].additem(start, stop, value)
for name,s in summers.iteritems():
print name, list(s.intervals())
答案 1 :(得分:0)
好吧,如果这些是染色体,让我们先将它们分开绘制出来:
{"Chr1": {(121,122):10, (123,130):30, ...},
"Chr2": {(230,233):16, ...},
...
}
你加上的数字是,我认为,某种分数 - 表达分数或其他。
如果位置范围(这些121,130数字定义间隔)足够小 - 任何高达数千 - 那么你可能通过存储每个位置的总计得分来节省自己的头痛,并且只是将间隔的分数添加到该间隔内的每个位置。
如果它们像个别基地位置,并且有数百万个可能的位置,那么你需要坚持间隔。因此,对于每一个,您需要检查相关染色体的重叠间隔,然后将其删除,并将它们分解为需要存储所有不同总和分数的较小间隔。
这是一个粗略的框架,但它并不完整:
for (start, end), score in intervals_to_add.items():
overlapping = {}
for (start1, end1), score1 in current_chromosome.items():
if start1 <= start <= end1 or start1 <= end <= end1:
overlapping[(start1, end1)] = score1
for interval in overlapping:
current_chromosome.pop(interval)
# Process overlapping into smaller intervals, adding in the current interval
current_chromosome.update(new_intervals)