有效编码(RLE)通常连续行号的数组

时间:2017-04-27 19:52:55

标签: python arrays encoding compression difference

我有一些文本转换代码(DSL引擎将DSL代码转换为本机代码),并且我试图通过将行号中的行号映射到源自它的DSL源代码行来提高诊断错误的能力,而不是生成源。

为此,在转换期间,我跟踪每个生成的本机代码块的每个转换或触发行号的原始行号。在开发结束时,我将行号作为普通列表发出;这使得人类更容易利用,例如__mapping__[line]会立即检索原始行号。然而,在生产环境中,保留这么多(可能很大)的任意列表仅仅是非常特殊的事件是次优的。相反,我想以一种明智的方式对这个列表进行编码,而不考虑编码难度(生成代码时的世界上所有时间)并且更喜欢简单快速的解码。

试验,我想出了:

from base64 import b64encode
from zlib import compress

def redelta_encode(numbers):
    """Encode a series of line numbers as the difference from line to line (deltas) to reduce entropy.

    The delta is stored as a single signed byte per line, meaning no two consecutive lines may vary by more than +/-
    127 lines within the original source material. The resulting bytestring is then zlib compressed and b64 encoded.

    Lines without line numbers (none or unexpected zeros) will inherit the last known line number after decoding.
    """

    def inner():
        lines = iter(numbers)
        prev = next(lines)

        for line in lines:
            delta = (line or prev) - prev  # Handle the "no line number given" case.
            if delta < 0: delta = -1 * delta + 127  # Store "signed" values.
            prev = (line or prev)  # Track our line number, or the last known good one.
            yield delta  # Store the delta.

    return b64encode(compress(bytes(bytearray(inner())))).decode('latin1')

在测试中,这实际上似乎是zlib中的一个病态案例:

In [1]: redelta_encode(list(range(10000)))
Out[1]: 'eJztwQEJAAAAwyDWv/RzHNQCAAAAAAAAAACAewMwvCcQ'

范围越大,A的存在越多,使得压缩的内容具有讽刺性的高度可压缩性。是否存在算法最优,更简单,更优选的方式来存储这种通常统一或单调的整数列表?显然,在完美的情况下,我什么都不存储:不需要数字转换,但大多数DSL会转换或生成线条,从而增加整体模式的不规则性。

请注意,这是针对FOSS marrow/dsl(DSL引擎)和cinje(使用它的模板引擎)项目。提前谢谢!

在被关闭为非活动状态一年后编辑:我终于想出了我正在寻找的正确名称;这是一种针对编码range()调整的游程编码(RLE)形式。我认为预处理(将实际行数转换为逐行增量)技术对此有所帮助。我cinje.std.html模块prepared an example translation,展示了02-generated.py第231行的行号映射。

1 个答案:

答案 0 :(得分:0)

实际上,游程长度编码似乎是解决我的问题的方法。给定如下的__mapping__(来自我的示例翻译):

__mapping__ = [0,2,3,4,5,6,6,6,6,6,6,6,6,6,7,8,9,10,11,12,12,12,12,13,14,15,16,17,19,20,21,22,23,23,23,24,25,25,25,28,29,29,29,31,32,32,32,34,35,35,35,35,38,39,40,40,40,40,41,42,43,44,44,44,46,47,47,47,47,50,51,52,52,52,52,53,54,55,56,58,59,60,60,60,61,62,63,65,66,67,67,67,68,68,68,68,68,68,69,70,71,73,74,74,76,77,78,79,79,79,79,80,81,81,81,81,81,82,82,82,82,82,82,83,83,85,86,87,87,87,87,88,89,89,89,90,90,90,90,90,90,91,91,93,94,95,95,95,95,96,97,97,97,98,99,100,101,103,103,105,106,107,107,107,107,108,109,109,109,109,110,111,112,113,115,115,115,115,117,118,119,119,119,119,120,121,121,121,121,121,121,123,124,125,125,125,125,126,127,128,129,130,131,132,133,134,137,138,138,138,138,139,140,141,141,141,142,142,142,144,145,146,146,146,149,149,149,149,151,151,151]

转换为元素内三角洲后,重复图案的形状应立即变得明显:

deltas = [2,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,2,1,1,1,1,0,0,1,1,0,0,3,1,0,0,2,1,0,0,2,1,0,0,0,3,1,1,0,0,0,1,1,1,1,0,0,2,1,0,0,0,3,1,1,0,0,0,1,1,1,1,2,1,1,0,0,1,1,1,2,1,1,0,0,1,0,0,0,0,0,1,1,1,2,1,0,2,1,1,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,2,1,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,2,1,1,0,0,0,1,1,0,0,1,1,1,1,2,0,2,1,1,0,0,0,1,1,0,0,0,1,1,1,1,2,0,0,0,2,1,1,0,0,0,1,1,0,0,0,0,0,2,1,1,0,0,0,1,1,1,1,1,1,1,1,1,3,1,0,0,0,1,1,1,0,0,1,0,0,2,1,1,0,0,3,0,0,0,2,0,0]

利用以下内容(考虑起来非常简单)groupby生成RLE形式:(来自this答案)

>>> rle = [(k, sum(1 for i in g)) for k,g in groupby(deltas)]; rle
[(2, 1), (1, 4), (0, 8), (1, 6), (0, 3), (1, 5), (2, 1), (1, 4), (0, 2), (1, 2), (0, 2), (3, 1), (1, 1), (0, 2), (2, 1), (1, 1), (0, 2), (2, 1), (1, 1), (0, 3), (3, 1), (1, 2), (0, 3), (1, 4), (0, 2), (2, 1), (1, 1), (0, 3), (3, 1), (1, 2), (0, 3), (1, 4), (2, 1), (1, 2), (0, 2), (1, 3), (2, 1), (1, 2), (0, 2), (1, 1), (0, 5), (1, 3), (2, 1), (1, 1), (0, 1), (2, 1), (1, 3), (0, 3), (1, 2), (0, 4), (1, 1), (0, 5), (1, 1), (0, 1), (2, 1), (1, 2), (0, 3), (1, 2), (0, 2), (1, 1), (0, 5), (1, 1), (0, 1), (2, 1), (1, 2), (0, 3), (1, 2), (0, 2), (1, 4), (2, 1), (0, 1), (2, 1), (1, 2), (0, 3), (1, 2), (0, 3), (1, 4), (2, 1), (0, 3), (2, 1), (1, 2), (0, 3), (1, 2), (0, 5), (2, 1), (1, 2), (0, 3), (1, 9), (3, 1), (1, 1), (0, 3), (1, 3), (0, 2), (1, 1), (0, 2), (2, 1), (1, 2), (0, 2), (3, 1), (0, 3), (2, 1), (0, 2)]

即使在字符串序列化之前,也要从这些结构中获取一些简单的统计信息:

>>> sys.getsizeof(__mapping__)
1912
>>> sys.getsizeof(deltas)
2072
>>> sys.getsizeof(rle)
912

原始大小的47%并非不合理的节省。要解码(“解压缩”)而不需要完全解压缩仅提取一个值,将需要在运行中进行迭代,在进行过程中对行进行计数,在达到目标生成的行号时停止:

from itertools import chain, groupby, repeat


__mapping__ = [0, 2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 8, 9, 10, 11, 12, 12, 12, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 23, 23, 24, 25, 25, 25, 28, 29, 29, 29, 31, 32, 32, 32, 34, 35, 35, 35, 35, 38, 39, 40, 40, 40, 40, 41, 42, 43, 44, 44, 44, 46, 47, 47, 47, 47, 50, 51, 52, 52, 52, 52, 53, 54, 55, 56, 58, 59, 60, 60, 60, 61, 62, 63, 65, 66, 67, 67, 67, 68, 68, 68, 68, 68, 68, 69, 70, 71, 73, 74, 74, 76, 77, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 82, 82, 82, 82, 82, 82, 83, 83, 85, 86, 87, 87, 87, 87, 88, 89, 89, 89, 90, 90, 90, 90, 90, 90, 91, 91, 93, 94, 95, 95, 95, 95, 96, 97, 97, 97, 98, 99, 100, 101, 103, 103, 105, 106, 107, 107, 107, 107, 108, 109, 109, 109, 109, 110, 111, 112, 113, 115, 115, 115, 115, 117, 118, 119, 119, 119, 119, 120, 121, 121, 121, 121, 121, 121, 123, 124, 125, 125, 125, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 137, 138, 138, 138, 138, 139, 140, 141, 141, 141, 142, 142, 142, 144, 145, 146, 146, 146, 149, 149, 149, 149, 151, 151, 151]


def delta_encode(numbers):
    lines = iter(numbers)
    prev = next(lines)

    for line in lines:
        delta = (line or prev) - prev
        yield delta
        prev = (line or prev)


def delta_decode(encoded):
    line = 0

    for n in chain.from_iterable(repeat(k, n) for k, n in encoded):
        yield line
        line += n

    yield line


rle = [(k, sum(1 for i in g)) for k, g in groupby(delta_encode(__mapping__))]


def line_for(target, rle):
    # The array of line numbers is zero-based, but humans don't think about code that way.
    target -= 1

    source = 0

    for line, result in enumerate(delta_decode(rle)):
        if line == target:
            return __mapping__[target] + 1, result + 1


print(line_for(1, rle))
print(line_for(20, rle))
print(line_for(200, rle))

得到正确的答案:(左侧为源映射结果,右侧为RLE解压缩结果)

(1, 1)
(13, 13)
(129, 129)