算法挑战:无损字符串压缩的任意就地基本转换

时间:2015-08-20 13:27:06

标签: algorithm math encoding cryptography compression

从真实世界的例子开始可能会有所帮助。假设我正在编写一个由MongoDB支持的Web应用程序,因此我的记录有一个长的十六进制主键,使我的url可以查看类似/widget/55c460d8e2d6e59da89d08d0的记录。这似乎过长了。网址可以使用更多字符。虽然24位十六进制数字中只有8 x 10^2816^24)个可能的值,但仅限于[a-zA-Z0-9]正则表达式类匹配的字符(YouTube视频ID使用更多) ,62个字符,您只需17个字符即可超过8 x 10^28

我想要一种算法,将任何限制为特定字符字符串的字符串转换为带有另一个字符字母的任何其他字符串,其中每个字符c的值可以被认为是{{1 }}

某种形式:

alphabet.indexOf(c)

假设

  • 所有参数均为字符串
  • convert(value, sourceAlphabet, destinationAlphabet) 中的每个字符都存在于value
  • sourceAlphabetsourceAlphabet中的每个字符都是唯一的

最简单的例子

destinationAlphabet

但我也希望它能够转换 War&和平从俄语字母加上一些标点符号到整个unicode字符集再次无损地返回。

这可能吗?

我曾经被教过在Comp Sci 101中进行基本转换的唯一方法是首先通过求和var hex = "0123456789abcdef"; var base10 = "0123456789"; var result = convert("12245589", base10, hex); // result is "bada55"; 转换为十进制整数,然后反向转换为目标基数。这种方法不足以转换非常长的字符串,因为整数变得太大。

当然感觉直观地说,当您逐步完成字符串(可能向后以维持标准的有效数字顺序)时,可以在适当的位置完成基本转换,以某种方式跟踪剩余部分,但是我我不够聪明,无法弄明白。

这就是你进来的地方,StackOverflow。你够聪明吗?

也许这是一个已经解决的问题,由一些18世纪的数学家在纸上完成,1970年在LISP上实施了打卡,以及Cryptography 101中的第一个家庭作业,但我的搜索没有结果。

我更喜欢具有功能风格的javascript解决方案,但任何语言或风格都可以,只要你没有欺骗一些大整数库。当然,奖励点是效率。

请不要批评原来的例子。解决问题的一般书呆子信誉比解决方案的任何应用都重要。

3 个答案:

答案 0 :(得分:2)

这是C中使用位移操作非常快的解决方案。它假定您知道解码字符串的长度应该是多少。字符串是每个字母表的0..maximum范围内的整数向量。用户可以使用受限制的字符范围进行转换。对于问题标题中的“就地”,源和目标向量可以重叠,但前提是源字母不大于目标字母。

/*
  recode version 1.0, 22 August 2015

  Copyright (C) 2015 Mark Adler

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Mark Adler
  madler@alumni.caltech.edu
*/

/* Recode a vector from one alphabet to another using intermediate
   variable-length bit codes. */

/* The approach is to use a Huffman code over equiprobable alphabets in two
   directions.  First to encode the source alphabet to a string of bits, and
   second to encode the string of bits to the destination alphabet. This will
   be reasonably close to the efficiency of base-encoding with arbitrary
   precision arithmetic. */

#include <stddef.h>     // size_t
#include <limits.h>     // UINT_MAX, ULLONG_MAX

#if UINT_MAX == ULLONG_MAX
#  error recode() assumes that long long has more bits than int
#endif

/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
   code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
   *dlen returns the length of the result, which will not exceed the value of
   *dlen when called.  If the original *dlen is not large enough to hold the
   full result, then recode() will return non-zero to indicate failure.
   Otherwise recode() will return 0.  recode() will also return non-zero if
   either of the smax or dmax parameters are less than one.  The non-zero
   return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
   and 3 if any of the elements of source are greater than smax.

   Using this same operation on the result with smax and dmax reversed reverses
   the operation, restoring the original vector.  However there may be more
   symbols returned than the original, so the number of symbols expected needs
   to be known for decoding.  (An end symbol could be appended to the source
   alphabet to include the length in the coding, but then encoding and decoding
   would no longer be symmetric, and the coding efficiency would be reduced.
   This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
           const unsigned *source, size_t slen, unsigned smax)
{
    // compute sbits and scut, with which we will recode the source with
    // sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
    if (smax < 1)
        return 2;
    unsigned sbits = 0;
    unsigned scut = 1;          // 2**sbits
    while (scut && scut <= smax) {
        scut <<= 1;
        sbits++;
    }
    scut -= smax + 1;

    // same thing for dbits and dcut
    if (dmax < 1)
        return 2;
    unsigned dbits = 0;
    unsigned dcut = 1;          // 2**dbits
    while (dcut && dcut <= dmax) {
        dcut <<= 1;
        dbits++;
    }
    dcut -= dmax + 1;

    // recode a base smax+1 vector to a base dmax+1 vector using an
    // intermediate bit vector (a sliding window of that bit vector is kept in
    // a bit buffer)
    unsigned long long buf = 0;     // bit buffer
    unsigned have = 0;              // number of bits in bit buffer
    size_t i = 0, n = 0;            // source and dest indices
    unsigned sym;                   // symbol being encoded
    for (;;) {
        // encode enough of source into bits to encode that to dest
        while (have < dbits && i < slen) {
            sym = source[i++];
            if (sym > smax) {
                *dlen = n;
                return 3;
            }
            if (sym < scut) {
                buf = (buf << (sbits - 1)) + sym;
                have += sbits - 1;
            }
            else {
                buf = (buf << sbits) + sym + scut;
                have += sbits;
            }
        }

        // if not enough bits to assure one symbol, then break out to a special
        // case for coding the final symbol
        if (have < dbits)
            break;

        // encode one symbol to dest
        if (n == *dlen)
            return 1;
        sym = buf >> (have - dbits + 1);
        if (sym < dcut) {
            dest[n++] = sym;
            have -= dbits - 1;
        }
        else {
            sym = buf >> (have - dbits);
            dest[n++] = sym - dcut;
            have -= dbits;
        }
        buf &= ((unsigned long long)1 << have) - 1;
    }

    // if any bits are left in the bit buffer, encode one last symbol to dest
    if (have) {
        if (n == *dlen)
            return 1;
        sym = buf;
        sym <<= dbits - 1 - have;
        if (sym >= dcut)
            sym = (sym << 1) - dcut;
        dest[n++] = sym;
    }

    // return recoded vector
    *dlen = n;
    return 0;
}

/* Test recode(). */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>

// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
    unsigned bits = 0;
    unsigned long long mask = 1;
    while (mask <= max) {
        mask <<= 1;
        bits++;
    }
    mask--;
    unsigned long long ran = 0;
    unsigned have = 0;
    size_t n = 0;
    while (n < len) {
        while (have < bits) {
            ran = (ran << 31) + random();
            have += 31;
        }
        if ((ran & mask) <= max)
            vec[n++] = ran & mask;
        ran >>= bits;
        have -= bits;
    }
}

// Get a valid number from str and assign it to var
#define NUM(var, str) \
    do { \
        char *end; \
        unsigned long val = strtoul(str, &end, 0); \
        var = val; \
        if (*end || var != val) { \
            fprintf(stderr, \
                    "invalid or out of range numeric argument: %s\n", str); \
            return 1; \
        } \
    } while (0)

/* "bet n m len count" generates count test vectors of length len, where each
   entry is in the range 0..n.  Each vector is recoded to another vector using
   only symbols in the range 0..m.  That vector is recoded back to a vector
   using only symbols in 0..n, and that result is compared with the original
   random vector.  Report on the average ratio of input and output symbols, as
   compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
    // get sizes of alphabets and length of test vector, compute maximum sizes
    // of recoded vectors
    unsigned smax, dmax, runs;
    size_t slen, dsize, bsize;
    if (argc != 5) { fputs("need four arguments\n", stderr); return 1; }
    NUM(smax, argv[1]);
    NUM(dmax, argv[2]);
    NUM(slen, argv[3]);
    NUM(runs, argv[4]);
    dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
    bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));

    // generate random test vectors, encode, decode, and compare
    srandomdev();
    unsigned source[slen], dest[dsize], back[bsize];
    unsigned mis = 0, i;
    unsigned long long dtot = 0;
    int ret;
    for (i = 0; i < runs; i++) {
        ranvec(source, slen, smax);
        size_t dlen = dsize;
        ret = recode(dest, &dlen, dmax, source, slen, smax);
        if (ret) {
            fprintf(stderr, "encode error %d\n", ret);
            break;
        }
        dtot += dlen;
        size_t blen = bsize;
        ret = recode(back, &blen, smax, dest, dlen, dmax);
        if (ret) {
            fprintf(stderr, "decode error %d\n", ret);
            break;
        }
        if (blen < slen || memcmp(source, back, slen))  // blen > slen is ok
            mis++;
    }
    if (mis)
        fprintf(stderr, "%u/%u mismatches!\n", mis, i);
    if (ret == 0)
        printf("mean dest/source symbols = %.4f (optimal = %.4f)\n",
               dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
    return 0;
}

答案 1 :(得分:1)

正如其他StackOverflow答案中所指出的那样,尽量不要将digit * base^position求和为将其转换为基数为10;更确切地说,将其视为指导计算机以自己的方式生成由数字表示的数量的表示(对于大多数计算机可能更接近我们的基数2的概念)。一旦计算机有自己的数量表示,我们就可以指示它以我们喜欢的任何方式输出数字。

通过拒绝“大整数”实现并要求逐个字母转换,您同时认为数量的数字/字母表示实际上并不是它,即每个位置代表{的数量{1}}。如果战争与和平的第九百万个字符确实代表了你要求它转换的东西,那么计算机在某个时候需要为digit * base^position生成一个表示。

答案 2 :(得分:0)

我认为任何解决方案都不能正常工作,因为如果对于某些整数e和某些MAX_INT,n e != m,因为没有办法计算其值如果n p >某个地方的目标基数MAX_INT。

对于某些e,n e == m的情况可以解决这个问题,因为问题是递归可行的(n的前e个数字可以求和并转换为第一个数字M,然后切断并重复。

如果你没有这个有用的属性,那么最终你将不得不尝试占用原始基础的某些部分并尝试在n p中执行模数并且n p 将大于MAX_INT,这意味着它是不可能的。