python按计数列出操作组

时间:2015-12-07 11:22:22

标签: python list group-by

我需要实现一个以列表作为参数的算法 即。

[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 5, 5]

并执行以下操作:

  1. 检查小序列(即以0或5开头的序列) 长度为0.5的大序列长度的比例,即序列[5,5]长度为2,是大(6)序列长度的0.33 - 6

  2. 然后将小序列合并为大序列 上面例子的输出应该是

    [0,0,0,0,6,6,6,6,6,6,6,6]

  3. 我有以下代码似乎相当复杂,我确信可以更有效地重新设计,当输入包含重复序列时,它也有一个错误,即

    [0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 0, 0]
    

    代码:

    def __radiate(self,threegroupslist,gr1,gr2,gr3):
            ### group by and count
            ### if chronologically the 2nd centroid seq. is below a threshold (function of the largest sequence)
            ### merge this sequence into the largest sequence
            inputlist = threegroupslist
            df = pd.DataFrame(threegroupslist)
            df.columns = ['cluster']
            groups = df.groupby('cluster').groups
            len1=len(groups.get(gr1))
            len2=len(groups.get(gr2))
            len3=len(groups.get(gr3))
            min(len1,len2,len3)/max(len1,len2,len3)
            minseq = min(len1,len2,len3)
            maxseq = max(len1,len2,len3)
            #map(lambda x: min(x), [i for i, x in enumerate(groups)])
            #if (minseq/maxseq < 0.5):
             #   indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
              #  threegroupslist[indices] = gr1
            if (len1 > len3 and len2/len1 < 0.5):
                # merge sublist2 into sublist 1
                indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
                threegroupslist[indices] = gr1
            elif (len1 <= len3 and len2/len3 < 0.5):
                # merge sublist2 into sublist 3
                indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
                threegroupslist[indices] = gr3
            elif (len3/len2 < 0.5):# len2 is the biggest and len3 is the smallest
                # merge sublist3 into sublist 2
                indices = [i for i, x in enumerate(threegroupslist) if x == gr3]
                threegroupslist[indices] = gr2
            elif (len1/len2 < 0.5):# len2 is the biggest and len1 is the smallest:
                # merge sublist2 into sublist 1
                indices = [i for i, x in enumerate(threegroupslist) if x == gr1]
                threegroupslist[indices] = gr2
            else:
                pass
            return threegroupslist
    

    我是python的新手很抱歉如果我犯了愚蠢的编码错误

2 个答案:

答案 0 :(得分:1)

它必须是熊猫吗?首先想到的是collections.Counter,它是标准库的一部分。

我们还需要math模块,然后可以定义序列。

import math as mm

import collections as cl

sequences = [0,0,0,0,0,6,6,6,6,6,6,5,5]

使用Counter获得每个子序列的长度。然后子序列按其长度排序。

collected_sequences = cl.Counter(sequences)
collected_sequences_ordered = collected_sequences.most_common()
most_common_value, most_common_count = collected_sequences_ordered[0]
threshold = math.floor(most_common_count * 0.5)

列表理解会派上用场,但又一次:)

# [fixed comparison]    
final_result = [value if times > threshold else most_common_value
                for value, times in collected_sequences.items()
                for i in range(times)]

我没有在其他序列上测试这个。我也没有打扰两个(最长)序列长度相同的情况。如果你想以一种特殊的方式处理这种情况,这个例程必须进行调整,但它不应该太难。

<强>更新

假设序列的顺序很重要。然后我们可以做以下

sequences_to_be_merged = [number for number, count in collected_sequences.items()
                          if count <= threshold]

final_result = [most_common_value if number in sequences_to_be_merged else number
                for number in sequences]

然而,序列不一定以这种方式合并。只有一个序列是最长的,而所有较小的序列都被转换但保留在相同的位置。 OP对此应该发生的事情的定义在这里并不十分精确。

答案 1 :(得分:0)

这是我到目前为止所建议的:

import operator

def compress(sequence):
    """Packs source sequence into sub-sequences and finds the longest"""
    if not sequence: return []
    max_length = 0
    max_item = None
    cur_length = 0
    cur_item = sequence[0]
    compressed = []
    for item in sequence:
        if item == cur_item:
            cur_length += 1
        else:
            compressed.append((cur_item, cur_length))
            if cur_length >= max_length:
                max_length, max_item = cur_length, cur_item
            cur_item, cur_length = item, 1
    # Forgot to handle the last sequence
    compressed.append((cur_item, cur_length))
    if cur_length >= max_length:
        max_length, max_item = cur_length, cur_item
    # End of fixes
    return compressed, max_length, max_item

if __name__ == '__main__':
    sequence = [1,1,1,1,2,2,3,3,6,6,6,6,6,6,6,6,4,4,4,4,4,3,3,7,7,7,1,1]
    compressed, max_length, max_item = compress(sequence)

    print 'original:', sequence
    print 'compressed:', compressed
    print 'max_length:', max_length, 'max_item:', max_item
    print 'uncompressed:', [[item] * length for item, length in compressed] # This is for example only

    pre_result = [[max_item if length * 2 <= max_length else item] * length for item, length in compressed]
    print 'uncompressed with replace:', pre_result
    result = reduce(operator.add, pre_result) # Join all sequences into one list again
    print 'result:', result

我正在倾倒大量的中间信息来演示处理过程。结果输出:

original: [1, 1, 1, 1, 2, 2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 3, 3, 7, 7, 7, 1, 1]
compressed: [(1, 4), (2, 2), (3, 2), (6, 8), (4, 5), (3, 2), (7, 3)]
max_length: 8 max_item: 6
uncompressed: [[1, 1, 1, 1], [2, 2], [3, 3], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [3, 3], [7, 7, 7]]
uncompressed with replace: [[6, 6, 6, 6], [6, 6], [6, 6], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [6, 6], [6, 6, 6]]
result: [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6]

我不确定你的内存限制,但是现在我假设你可以负担得起额外的列表。如果你不能,那么解决方案将涉及更多的索引爬行列表和更少的列表理解。