Python - 比较两个不同列表中的子列表

时间:2013-12-07 22:04:55

标签: python list comparison sublist

我有两个列表,每个列表都包含[chromosome,start_position,end_position]形式的子列表:

expos_list = [['1', '10', '30'], ['1', '50', '80'], ['1', '100', '200']]  
pos_list = [['1', '12', '25'], ['1', '90', '98'], ['1', '130', '180'], ['2', '25', '50']]  

我想将'pos_list'中的子列表与'expos_list'中的子列表进行比较,然后将'pos_list'元素添加到'expos_list',如果它是唯一的和/或不包含在另一个expos_list元素中。所以我希望我的最终输出是:

expos_list = [['1', '10', '30'], ['1', '50', '80'], ['1', '90', '98'], ['1', '100', '200'], ['2', '25', '50']]  

...因为每个特定染色体只有唯一的位置范围(染色体=子列表[0])。

我试过了:

for expos_element in expos_list:
    for pos_element in pos_list:
        if pos_element[0] == expos_element[0]:
            if pos_element[1] < expos_element[1]:
                if pos_element[2] < expos_element[1]:
                    print("New")
                elif pos_element[2] < expos_element[2]:
                print("Overlapping at 3'")
                else:
                print("Discard")
            elif expos_element[1] <= pos_element[1] < expos_element[2]:
                if pos_element[2] <= expos_element[2]:
                print("Discard")
                else:
                print("Overlapping at 5'")
            else:
            print("Hit is 3' of current existing element. Move on")
        else:
        print("Different chromosome")

这显然没有附加到列表位等,但是指定元素是否重叠。它会这样做,但会一直比较所有元素,给出这个输出:

Discard
Hit is 3' of current existing element. Move on
Discard
Different chromosome
New
Hit is 3' of current existing element. Move on
New
Different chromosome
Overlapping at 5'
Hit is 3' of current existing element. Move on
Discard
Different chromosome  

这给出了12行输出(而不是pos_list中每个子列表所需的一行)。我真的很难让这个工作。我猜我上面运行的理想输出是:

Discard
New
Discard
Different chromosome  

非常感谢任何帮助。谢谢!

3 个答案:

答案 0 :(得分:1)

如果您对每个项目的重叠方式不感兴趣,请将代码简化为三种情况(丢弃,新增,不同):

new_items = []
for item in pos_list:
    if not any(x[0] == item[0] for x in expos_list):
        print("Different chromosome")
        new_items.append(item)
    elif any(x[1] < item[1] < x[2] or x[1] < item[2] < x[2]
             for x in expos_list):
        print("Discard")
    else:
        print("New")
        new_items.append(item)
expos_list.extend(new_items)
print(expos_list)

当我运行它时,我看到:

Discard
New
Discard
Different chromosome
[['1', '10', '30'], ['1', '50', '80'], ['1', '100', '200'], ['1', '90', '98'], ['2', '25', '50']]

答案 1 :(得分:1)

我把它分开了一点;尝试

from collections import defaultdict
from bisect import bisect_left

class ChromoSegments:
    def __init__(self, cs=None):
        # A list of [(start, end), (start, end), ...] per chromosome;
        #   each list is kept sorted in ascending order
        self.segments = defaultdict(list)

        # Add segments from parameter list
        if cs is not None:
            for chromo,start,end in cs:
                try:
                    self.add_seg(chromo, start, end)
                except ValueError:
                    pass

    def add_seg(self, chromo, start, end):
        seg = self.segments[chromo]
        val = (start, end)
        ndx = bisect_left(seg, val)
        if (ndx == 0 or seg[ndx - 1][1] < start):
            if (ndx == len(seg) or end < seg[ndx][0]):
                seg.insert(ndx, val)
            else:
                # collision with following element
                nstart, nend = seg[ndx]
                raise ValueError('Discard ({}, {}, {}): collision with ({}, {}, {})'.format(chromo, start, end, chromo, nstart, nend))
        else:
            # collision with preceding element
            nstart, nend = seg[ndx - 1]
            raise ValueError('Discard ({}, {}, {}): collision with ({}, {}, {})'.format(chromo, start, end, chromo, nstart, nend))

    def to_list(self):
        keys = sorted(self.segments.keys())
        return [(k, s, e) for k in keys for s,e in self.segments[k]]

def main():
    expos = ChromoSegments([[1, 10, 30], [1, 50, 80], [1, 100, 200]])
    pos = [[1, 12, 25], [1, 90, 98], [1, 130, 180], [2, 25, 50]]

    target_chromo = 1
    for seg in pos:
        if seg[0] != target_chromo:
            print('Different chromosome')
        else:
            try:
                expos.add_seg(*seg)
                print('New')
            except ValueError, e:
                print(e.message)

    print('\nResult: {}'.format(expos.to_list()))

if __name__ == "__main__":
    main()

产生

Discard (1, 12, 25): collision with (1, 10, 30)
New
Discard (1, 130, 180): collision with (1, 100, 200)
Different chromosome

Result: [(1, 10, 30), (1, 50, 80), (1, 90, 98), (1, 100, 200)]

请注意,我写了这个类来正确处理多个染色体; “不同的染色体”警告必须在main()中单独处理。

修改

如果你想处理多条染色体,main()可以这样简化:

def main():
    expos = ChromoSegments([[1, 10, 30], [1, 50, 80], [1, 100, 200]])
    pos = [[1, 12, 25], [1, 90, 98], [1, 130, 180], [2, 25, 50]]

    for seg in pos:
        try:
            expos.add_seg(*seg)
            print('New')
        except ValueError, e:
            print(e.message)

    print('\nResult: {}'.format(expos.to_list()))

,输出变为

Discard (1, 12, 25): collision with (1, 10, 30)
New
Discard (1, 130, 180): collision with (1, 100, 200)
New

Result: [(1, 10, 30), (1, 50, 80), (1, 90, 98), (1, 100, 200), (2, 25, 50)]

答案 2 :(得分:0)

这个从DNA片段的起点和终点生成一个范围 只需检查新片段中的位置是否已存在于现有范围内。 由其他人建议的dict处理多条染色体。

class DNASegments():
    def __init__(self, segments):
        segments = self._convert(segments)
        self.segments = {}
        for s in segments:
            self.add_segment(s)

    def add_segment(self, s):
        chromosome = s[0]
        positions = range(s[1], s[2]+1)

        if not chromosome in self.segments.keys():
            self.segments[chromosome] = []

        if not any([p in range(ps[0], ps[1]+1) for ps\
                    in self.segments[chromosome] for p in positions]):
            self.segments[chromosome].append([s[1], s[2]])

        self._sort()

    def add_list(self, segments):
        segments = self._convert(segments)
        for s in segments:
            self.add_segment(s)
        self._sort()

    def _convert(self,segments):
        return [[int(s[0]), int(s[1]), int(s[2])] for s in segments]
    def _sort(self):
        for key in self.segments.keys():
            self.segments[key] = sorted(self.segments[key])

    def __repr__(self):
        return str(self.segments)
    def __str__(self):
        return str(self.segments)
    def __getitem__(self, i):
        return self.segments[i]



expos_list = [['1', '10', '30'], ['1', '50', '80'], ['1', '100', '200']]  
pos_list = [['1', '12', '25'], ['1', '90', '98'], ['1', '130', '180'], 
            ['2', '25', '50']]

dnaseg = DNASegments(expos_list)

dnaseg
>>> {1: [[10, 30], [50, 80], [100, 200]]}

dnaseg.add_list(pos_list)

dnaseg
>>> {1: [[10, 30], [50, 80], [90, 98], [100, 200]], 2: [[25, 50]]}