Question

我需要实现一个以列表作为参数的算法即。

[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 5, 5]

并执行以下操作：

检查小序列（即以0或5开头的序列）长度为0.5的大序列长度的比例，即序列[5,5]长度为2，是大（6）序列长度的0.33 - 6
然后将小序列合并为大序列上面例子的输出应该是

[0,0,0,0,6,6,6,6,6,6,6,6]

我有以下代码似乎相当复杂，我确信可以更有效地重新设计，当输入包含重复序列时，它也有一个错误，即

[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 0, 0]

代码：

def __radiate(self,threegroupslist,gr1,gr2,gr3):
        ### group by and count
        ### if chronologically the 2nd centroid seq. is below a threshold (function of the largest sequence)
        ### merge this sequence into the largest sequence
        inputlist = threegroupslist
        df = pd.DataFrame(threegroupslist)
        df.columns = ['cluster']
        groups = df.groupby('cluster').groups
        len1=len(groups.get(gr1))
        len2=len(groups.get(gr2))
        len3=len(groups.get(gr3))
        min(len1,len2,len3)/max(len1,len2,len3)
        minseq = min(len1,len2,len3)
        maxseq = max(len1,len2,len3)
        #map(lambda x: min(x), [i for i, x in enumerate(groups)])
        #if (minseq/maxseq < 0.5):
         #   indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
          #  threegroupslist[indices] = gr1
        if (len1 > len3 and len2/len1 < 0.5):
            # merge sublist2 into sublist 1
            indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
            threegroupslist[indices] = gr1
        elif (len1 <= len3 and len2/len3 < 0.5):
            # merge sublist2 into sublist 3
            indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
            threegroupslist[indices] = gr3
        elif (len3/len2 < 0.5):# len2 is the biggest and len3 is the smallest
            # merge sublist3 into sublist 2
            indices = [i for i, x in enumerate(threegroupslist) if x == gr3]
            threegroupslist[indices] = gr2
        elif (len1/len2 < 0.5):# len2 is the biggest and len1 is the smallest:
            # merge sublist2 into sublist 1
            indices = [i for i, x in enumerate(threegroupslist) if x == gr1]
            threegroupslist[indices] = gr2
        else:
            pass
        return threegroupslist

我是python的新手很抱歉如果我犯了愚蠢的编码错误

Answer 1

它必须是熊猫吗？首先想到的是collections.Counter，它是标准库的一部分。

我们还需要math模块，然后可以定义序列。

import math as mm

import collections as cl

sequences = [0,0,0,0,0,6,6,6,6,6,6,5,5]

使用Counter获得每个子序列的长度。然后子序列按其长度排序。

collected_sequences = cl.Counter(sequences)
collected_sequences_ordered = collected_sequences.most_common()
most_common_value, most_common_count = collected_sequences_ordered[0]
threshold = math.floor(most_common_count * 0.5)

列表理解会派上用场，但又一次：）

# [fixed comparison]    
final_result = [value if times > threshold else most_common_value
                for value, times in collected_sequences.items()
                for i in range(times)]

我没有在其他序列上测试这个。我也没有打扰两个（最长）序列长度相同的情况。如果你想以一种特殊的方式处理这种情况，这个例程必须进行调整，但它不应该太难。

<强>更新

假设序列的顺序很重要。然后我们可以做以下

sequences_to_be_merged = [number for number, count in collected_sequences.items()
                          if count <= threshold]

final_result = [most_common_value if number in sequences_to_be_merged else number
                for number in sequences]

然而，序列不一定以这种方式合并。只有一个序列是最长的，而所有较小的序列都被转换但保留在相同的位置。 OP对此应该发生的事情的定义在这里并不十分精确。

Answer 2

这是我到目前为止所建议的：

import operator

def compress(sequence):
    """Packs source sequence into sub-sequences and finds the longest"""
    if not sequence: return []
    max_length = 0
    max_item = None
    cur_length = 0
    cur_item = sequence[0]
    compressed = []
    for item in sequence:
        if item == cur_item:
            cur_length += 1
        else:
            compressed.append((cur_item, cur_length))
            if cur_length >= max_length:
                max_length, max_item = cur_length, cur_item
            cur_item, cur_length = item, 1
    # Forgot to handle the last sequence
    compressed.append((cur_item, cur_length))
    if cur_length >= max_length:
        max_length, max_item = cur_length, cur_item
    # End of fixes
    return compressed, max_length, max_item

if __name__ == '__main__':
    sequence = [1,1,1,1,2,2,3,3,6,6,6,6,6,6,6,6,4,4,4,4,4,3,3,7,7,7,1,1]
    compressed, max_length, max_item = compress(sequence)

    print 'original:', sequence
    print 'compressed:', compressed
    print 'max_length:', max_length, 'max_item:', max_item
    print 'uncompressed:', [[item] * length for item, length in compressed] # This is for example only

    pre_result = [[max_item if length * 2 <= max_length else item] * length for item, length in compressed]
    print 'uncompressed with replace:', pre_result
    result = reduce(operator.add, pre_result) # Join all sequences into one list again
    print 'result:', result

我正在倾倒大量的中间信息来演示处理过程。结果输出：

original: [1, 1, 1, 1, 2, 2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 3, 3, 7, 7, 7, 1, 1]
compressed: [(1, 4), (2, 2), (3, 2), (6, 8), (4, 5), (3, 2), (7, 3)]
max_length: 8 max_item: 6
uncompressed: [[1, 1, 1, 1], [2, 2], [3, 3], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [3, 3], [7, 7, 7]]
uncompressed with replace: [[6, 6, 6, 6], [6, 6], [6, 6], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [6, 6], [6, 6, 6]]
result: [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6]

我不确定你的内存限制，但是现在我假设你可以负担得起额外的列表。如果你不能，那么解决方案将涉及更多的索引爬行列表和更少的列表理解。

python按计数列出操作组

2 个答案: