我需要实现一个以列表作为参数的算法 即。
[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 5, 5]
并执行以下操作:
检查小序列(即以0或5开头的序列) 长度为0.5的大序列长度的比例,即序列[5,5]长度为2,是大(6)序列长度的0.33 - 6
然后将小序列合并为大序列 上面例子的输出应该是
[0,0,0,0,6,6,6,6,6,6,6,6]
我有以下代码似乎相当复杂,我确信可以更有效地重新设计,当输入包含重复序列时,它也有一个错误,即
[0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 0, 0]
代码:
def __radiate(self,threegroupslist,gr1,gr2,gr3):
### group by and count
### if chronologically the 2nd centroid seq. is below a threshold (function of the largest sequence)
### merge this sequence into the largest sequence
inputlist = threegroupslist
df = pd.DataFrame(threegroupslist)
df.columns = ['cluster']
groups = df.groupby('cluster').groups
len1=len(groups.get(gr1))
len2=len(groups.get(gr2))
len3=len(groups.get(gr3))
min(len1,len2,len3)/max(len1,len2,len3)
minseq = min(len1,len2,len3)
maxseq = max(len1,len2,len3)
#map(lambda x: min(x), [i for i, x in enumerate(groups)])
#if (minseq/maxseq < 0.5):
# indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
# threegroupslist[indices] = gr1
if (len1 > len3 and len2/len1 < 0.5):
# merge sublist2 into sublist 1
indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
threegroupslist[indices] = gr1
elif (len1 <= len3 and len2/len3 < 0.5):
# merge sublist2 into sublist 3
indices = [i for i, x in enumerate(threegroupslist) if x == gr2]
threegroupslist[indices] = gr3
elif (len3/len2 < 0.5):# len2 is the biggest and len3 is the smallest
# merge sublist3 into sublist 2
indices = [i for i, x in enumerate(threegroupslist) if x == gr3]
threegroupslist[indices] = gr2
elif (len1/len2 < 0.5):# len2 is the biggest and len1 is the smallest:
# merge sublist2 into sublist 1
indices = [i for i, x in enumerate(threegroupslist) if x == gr1]
threegroupslist[indices] = gr2
else:
pass
return threegroupslist
我是python的新手很抱歉如果我犯了愚蠢的编码错误
答案 0 :(得分:1)
它必须是熊猫吗?首先想到的是collections.Counter
,它是标准库的一部分。
我们还需要math
模块,然后可以定义序列。
import math as mm
import collections as cl
sequences = [0,0,0,0,0,6,6,6,6,6,6,5,5]
使用Counter
获得每个子序列的长度。然后子序列按其长度排序。
collected_sequences = cl.Counter(sequences)
collected_sequences_ordered = collected_sequences.most_common()
most_common_value, most_common_count = collected_sequences_ordered[0]
threshold = math.floor(most_common_count * 0.5)
列表理解会派上用场,但又一次:)
# [fixed comparison]
final_result = [value if times > threshold else most_common_value
for value, times in collected_sequences.items()
for i in range(times)]
我没有在其他序列上测试这个。我也没有打扰两个(最长)序列长度相同的情况。如果你想以一种特殊的方式处理这种情况,这个例程必须进行调整,但它不应该太难。
<强>更新强>
假设序列的顺序很重要。然后我们可以做以下
sequences_to_be_merged = [number for number, count in collected_sequences.items()
if count <= threshold]
final_result = [most_common_value if number in sequences_to_be_merged else number
for number in sequences]
然而,序列不一定以这种方式合并。只有一个序列是最长的,而所有较小的序列都被转换但保留在相同的位置。 OP对此应该发生的事情的定义在这里并不十分精确。
答案 1 :(得分:0)
这是我到目前为止所建议的:
import operator
def compress(sequence):
"""Packs source sequence into sub-sequences and finds the longest"""
if not sequence: return []
max_length = 0
max_item = None
cur_length = 0
cur_item = sequence[0]
compressed = []
for item in sequence:
if item == cur_item:
cur_length += 1
else:
compressed.append((cur_item, cur_length))
if cur_length >= max_length:
max_length, max_item = cur_length, cur_item
cur_item, cur_length = item, 1
# Forgot to handle the last sequence
compressed.append((cur_item, cur_length))
if cur_length >= max_length:
max_length, max_item = cur_length, cur_item
# End of fixes
return compressed, max_length, max_item
if __name__ == '__main__':
sequence = [1,1,1,1,2,2,3,3,6,6,6,6,6,6,6,6,4,4,4,4,4,3,3,7,7,7,1,1]
compressed, max_length, max_item = compress(sequence)
print 'original:', sequence
print 'compressed:', compressed
print 'max_length:', max_length, 'max_item:', max_item
print 'uncompressed:', [[item] * length for item, length in compressed] # This is for example only
pre_result = [[max_item if length * 2 <= max_length else item] * length for item, length in compressed]
print 'uncompressed with replace:', pre_result
result = reduce(operator.add, pre_result) # Join all sequences into one list again
print 'result:', result
我正在倾倒大量的中间信息来演示处理过程。结果输出:
original: [1, 1, 1, 1, 2, 2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 3, 3, 7, 7, 7, 1, 1]
compressed: [(1, 4), (2, 2), (3, 2), (6, 8), (4, 5), (3, 2), (7, 3)]
max_length: 8 max_item: 6
uncompressed: [[1, 1, 1, 1], [2, 2], [3, 3], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [3, 3], [7, 7, 7]]
uncompressed with replace: [[6, 6, 6, 6], [6, 6], [6, 6], [6, 6, 6, 6, 6, 6, 6, 6], [4, 4, 4, 4, 4], [6, 6], [6, 6, 6]]
result: [6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 6, 6, 6, 6, 6]
我不确定你的内存限制,但是现在我假设你可以负担得起额外的列表。如果你不能,那么解决方案将涉及更多的索引爬行列表和更少的列表理解。