GUID的标准字符串表示形式大约需要36个字符。哪个非常好,但也非常浪费。我想知道如何使用33-127范围内的所有ASCII字符以最短的方式对其进行编码。天真的实现产生22个字符,因为 128位 / 6位产生22个字符。
霍夫曼编码是我的第二好,唯一的问题是如何选择代码......
当然,编码必须是无损的。
答案 0 :(得分:31)
这是一个老问题,但我必须解决它,以便我正在努力的系统向后兼容。
确切的要求是客户端生成的标识符,该标识符将写入数据库并存储在20个字符的唯一列中。它从未向用户显示,也没有以任何方式编入索引。
由于我无法消除这个要求,我真的想使用一个Guid(statistically unique),如果我可以无损地编码成20个字符,那么考虑到约束,这将是一个很好的解决方案。
Ascii-85允许您将4个字节的二进制数据编码为5个字节的Ascii数据。因此,使用此编码方案,16字节guid将适合20个Ascii字符。 Guid可以具有3.1962657931507848761677563491821e + 38个离散值,而Ascii-85的20个字符可以具有3.8759531084514355873123178482056e + 38个离散值。
写入数据库时,我有一些关于截断的问题,因此编码中不包含空格字符。我也遇到了collation的问题,我通过从编码中排除小写字符来解决这个问题。此外,它只会通过paramaterized command传递,因此任何特殊的SQL字符都将自动转义。
我已经包含了C#代码来执行Ascii-85编码和解码,以防它帮助任何人。显然,根据您的使用情况,您可能需要选择不同的字符集,因为我的约束使我选择了一些不寻常的字符,如'ß'和'Ø' - 但这很容易:
/// <summary>
/// This code implements an encoding scheme that uses 85 printable ascii characters
/// to encode the same volume of information as contained in a Guid.
///
/// Ascii-85 can represent 4 binary bytes as 5 Ascii bytes. So a 16 byte Guid can be
/// represented in 20 Ascii bytes. A Guid can have
/// 3.1962657931507848761677563491821e+38 discrete values whereas 20 characters of
/// Ascii-85 can have 3.8759531084514355873123178482056e+38 discrete values.
///
/// Lower-case characters are not included in this encoding to avoid collation
/// issues.
/// This is a departure from standard Ascii-85 which does include lower case
/// characters.
/// In addition, no whitespace characters are included as these may be truncated in
/// the database depending on the storage mechanism - ie VARCHAR vs CHAR.
/// </summary>
internal static class Ascii85
{
/// <summary>
/// 85 printable ascii characters with no lower case ones, so database
/// collation can't bite us. No ' ' character either so database can't
/// truncate it!
/// Unfortunately, these limitation mean resorting to some strange
/// characters like 'Æ' but we won't ever have to type these, so it's ok.
/// </summary>
private static readonly char[] kEncodeMap = new[]
{
'0','1','2','3','4','5','6','7','8','9', // 10
'A','B','C','D','E','F','G','H','I','J', // 20
'K','L','M','N','O','P','Q','R','S','T', // 30
'U','V','W','X','Y','Z','|','}','~','{', // 40
'!','"','#','$','%','&','\'','(',')','`', // 50
'*','+',',','-','.','/','[','\\',']','^', // 60
':',';','<','=','>','?','@','_','¼','½', // 70
'¾','ß','Ç','Ð','€','«','»','¿','•','Ø', // 80
'£','†','‡','§','¥' // 85
};
/// <summary>
/// A reverse mapping of the <see cref="kEncodeMap"/> array for decoding
/// purposes.
/// </summary>
private static readonly IDictionary<char, byte> kDecodeMap;
/// <summary>
/// Initialises the <see cref="kDecodeMap"/>.
/// </summary>
static Ascii85()
{
kDecodeMap = new Dictionary<char, byte>();
for (byte i = 0; i < kEncodeMap.Length; i++)
{
kDecodeMap.Add(kEncodeMap[i], i);
}
}
/// <summary>
/// Decodes an Ascii-85 encoded Guid.
/// </summary>
/// <param name="ascii85Encoding">The Guid encoded using Ascii-85.</param>
/// <returns>A Guid decoded from the parameter.</returns>
public static Guid Decode(string ascii85Encoding)
{
// Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
// Since a Guid is 16 bytes long, the Ascii-85 encoding should be 20
// characters long.
if(ascii85Encoding.Length != 20)
{
throw new ArgumentException(
"An encoded Guid should be 20 characters long.",
"ascii85Encoding");
}
// We only support upper case characters.
ascii85Encoding = ascii85Encoding.ToUpper();
// Split the string in half and decode each substring separately.
var higher = ascii85Encoding.Substring(0, 10).AsciiDecode();
var lower = ascii85Encoding.Substring(10, 10).AsciiDecode();
// Convert the decoded substrings into an array of 16-bytes.
var byteArray = new[]
{
(byte)((higher & 0xFF00000000000000) >> 56),
(byte)((higher & 0x00FF000000000000) >> 48),
(byte)((higher & 0x0000FF0000000000) >> 40),
(byte)((higher & 0x000000FF00000000) >> 32),
(byte)((higher & 0x00000000FF000000) >> 24),
(byte)((higher & 0x0000000000FF0000) >> 16),
(byte)((higher & 0x000000000000FF00) >> 8),
(byte)((higher & 0x00000000000000FF)),
(byte)((lower & 0xFF00000000000000) >> 56),
(byte)((lower & 0x00FF000000000000) >> 48),
(byte)((lower & 0x0000FF0000000000) >> 40),
(byte)((lower & 0x000000FF00000000) >> 32),
(byte)((lower & 0x00000000FF000000) >> 24),
(byte)((lower & 0x0000000000FF0000) >> 16),
(byte)((lower & 0x000000000000FF00) >> 8),
(byte)((lower & 0x00000000000000FF)),
};
return new Guid(byteArray);
}
/// <summary>
/// Encodes binary data into a plaintext Ascii-85 format string.
/// </summary>
/// <param name="guid">The Guid to encode.</param>
/// <returns>Ascii-85 encoded string</returns>
public static string Encode(Guid guid)
{
// Convert the 128-bit Guid into two 64-bit parts.
var byteArray = guid.ToByteArray();
var higher =
((UInt64)byteArray[0] << 56) | ((UInt64)byteArray[1] << 48) |
((UInt64)byteArray[2] << 40) | ((UInt64)byteArray[3] << 32) |
((UInt64)byteArray[4] << 24) | ((UInt64)byteArray[5] << 16) |
((UInt64)byteArray[6] << 8) | byteArray[7];
var lower =
((UInt64)byteArray[ 8] << 56) | ((UInt64)byteArray[ 9] << 48) |
((UInt64)byteArray[10] << 40) | ((UInt64)byteArray[11] << 32) |
((UInt64)byteArray[12] << 24) | ((UInt64)byteArray[13] << 16) |
((UInt64)byteArray[14] << 8) | byteArray[15];
var encodedStringBuilder = new StringBuilder();
// Encode each part into an ascii-85 encoded string.
encodedStringBuilder.AsciiEncode(higher);
encodedStringBuilder.AsciiEncode(lower);
return encodedStringBuilder.ToString();
}
/// <summary>
/// Encodes the given integer using Ascii-85.
/// </summary>
/// <param name="encodedStringBuilder">The <see cref="StringBuilder"/> to
/// append the results to.</param>
/// <param name="part">The integer to encode.</param>
private static void AsciiEncode(
this StringBuilder encodedStringBuilder, UInt64 part)
{
// Nb, the most significant digits in our encoded character will
// be the right-most characters.
var charCount = (UInt32)kEncodeMap.Length;
// Ascii-85 can encode 4 bytes of binary data into 5 bytes of Ascii.
// Since a UInt64 is 8 bytes long, the Ascii-85 encoding should be
// 10 characters long.
for (var i = 0; i < 10; i++)
{
// Get the remainder when dividing by the base.
var remainder = part % charCount;
// Divide by the base.
part /= charCount;
// Add the appropriate character for the current value (0-84).
encodedStringBuilder.Append(kEncodeMap[remainder]);
}
}
/// <summary>
/// Decodes the given string from Ascii-85 to an integer.
/// </summary>
/// <param name="ascii85EncodedString">Decodes a 10 character Ascii-85
/// encoded string.</param>
/// <returns>The integer representation of the parameter.</returns>
private static UInt64 AsciiDecode(this string ascii85EncodedString)
{
if (ascii85EncodedString.Length != 10)
{
throw new ArgumentException(
"An Ascii-85 encoded Uint64 should be 10 characters long.",
"ascii85EncodedString");
}
// Nb, the most significant digits in our encoded character
// will be the right-most characters.
var charCount = (UInt32)kEncodeMap.Length;
UInt64 result = 0;
// Starting with the right-most (most-significant) character,
// iterate through the encoded string and decode.
for (var i = ascii85EncodedString.Length - 1; i >= 0; i--)
{
// Multiply the current decoded value by the base.
result *= charCount;
// Add the integer value for that encoded character.
result += kDecodeMap[ascii85EncodedString[i]];
}
return result;
}
}
此外,还有单元测试。它们不像我想的那么彻底,我不喜欢使用Guid.NewGuid()
的地方的非确定性,但它们应该让你开始:
/// <summary>
/// Tests to verify that the Ascii-85 encoding is functioning as expected.
/// </summary>
[TestClass]
[UsedImplicitly]
public class Ascii85Tests
{
[TestMethod]
[Description("Ensure that the Ascii-85 encoding is correct.")]
[UsedImplicitly]
public void CanEncodeAndDecodeAGuidUsingAscii85()
{
var guidStrings = new[]
{
"00000000-0000-0000-0000-000000000000",
"00000000-0000-0000-0000-0000000000FF",
"00000000-0000-0000-0000-00000000FF00",
"00000000-0000-0000-0000-000000FF0000",
"00000000-0000-0000-0000-0000FF000000",
"00000000-0000-0000-0000-00FF00000000",
"00000000-0000-0000-0000-FF0000000000",
"00000000-0000-0000-00FF-000000000000",
"00000000-0000-0000-FF00-000000000000",
"00000000-0000-00FF-0000-000000000000",
"00000000-0000-FF00-0000-000000000000",
"00000000-00FF-0000-0000-000000000000",
"00000000-FF00-0000-0000-000000000000",
"000000FF-0000-0000-0000-000000000000",
"0000FF00-0000-0000-0000-000000000000",
"00FF0000-0000-0000-0000-000000000000",
"FF000000-0000-0000-0000-000000000000",
"FF000000-0000-0000-0000-00000000FFFF",
"00000000-0000-0000-0000-0000FFFF0000",
"00000000-0000-0000-0000-FFFF00000000",
"00000000-0000-0000-FFFF-000000000000",
"00000000-0000-FFFF-0000-000000000000",
"00000000-FFFF-0000-0000-000000000000",
"0000FFFF-0000-0000-0000-000000000000",
"FFFF0000-0000-0000-0000-000000000000",
"00000000-0000-0000-0000-0000FFFFFFFF",
"00000000-0000-0000-FFFF-FFFF00000000",
"00000000-FFFF-FFFF-0000-000000000000",
"FFFFFFFF-0000-0000-0000-000000000000",
"00000000-0000-0000-FFFF-FFFFFFFFFFFF",
"FFFFFFFF-FFFF-FFFF-0000-000000000000",
"FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF",
"1000000F-100F-100F-100F-10000000000F"
};
foreach (var guidString in guidStrings)
{
var guid = new Guid(guidString);
var encoded = Ascii85.Encode(guid);
Assert.AreEqual(
20,
encoded.Length,
"A guid encoding should not exceed 20 characters.");
var decoded = Ascii85.Decode(encoded);
Assert.AreEqual(
guid,
decoded,
"The guids are different after being encoded and decoded.");
}
}
[TestMethod]
[Description(
"The Ascii-85 encoding is not susceptible to changes in character case.")]
[UsedImplicitly]
public void Ascii85IsCaseInsensitive()
{
const int kCount = 50;
for (var i = 0; i < kCount; i++)
{
var guid = Guid.NewGuid();
// The encoding should be all upper case. A reliance
// on mixed case will make the generated string
// vulnerable to sql collation.
var encoded = Ascii85.Encode(guid);
Assert.AreEqual(
encoded,
encoded.ToUpper(),
"The Ascii-85 encoding should produce only uppercase characters.");
}
}
}
我希望这能为别人省点麻烦。
另外,如果您发现任何错误,请告诉我们; - )
答案 1 :(得分:19)
使用Base 85。 见4.1节。 为什么85? A Compact Representation of IPv6 Addresses
IPv6地址,就像GUID一样,由8个16位片组成。
答案 2 :(得分:14)
你有95个字符可用 - 所以,超过6位,但不到7(实际约为6.57)。您可以使用128 / log2(95)=约19.48个字符,编码为20个字符。如果以编码形式保存2个字符值得失去对您的可读性,例如(伪代码):
char encoded[21];
long long guid; // 128 bits number
for(int i=0; i<20; ++i) {
encoded[i] = chr(guid % 95 + 33);
guid /= 95;
}
encoded[20] = chr(0);
这基本上是通用的“在一些基数中编码一个数字”代码,除了不需要反转“数字”,因为顺序是任意的(并且little-endian更直接和自然)。从编码字符串中取回guid是以非常类似的方式,在基数95中的多项式计算(当然从每个数字减去33之后):
guid = 0;
for(int i=0; i<20; ++i) {
guid *= 95;
guid += ord(encoded[i]) - 33;
}
主要使用Horner的多项式评估方法。
答案 3 :(得分:4)
只需转到Base64。
答案 4 :(得分:3)
使用33的全部范围(偶尔会错误的空格?)到127可以给出95个可能的字符。在基数95中表示guid的2^128
可能值将使用20个字符。这样(模拟诸如丢弃不变的nybbles之类的东西)是你能做的最好的事情。省去麻烦 - 使用base 64。
答案 5 :(得分:0)
假设所有GUID都是由同一算法生成的,在应用任何其他编码之前,您可以通过不编码算法半字节来保存4位: - |
答案 6 :(得分:0)
任意 GUID? “天真”算法将产生最佳结果。进一步压缩GUID的唯一方法是利用“任意”约束排除的数据中的模式。
答案 7 :(得分:0)
我同意Base64方法。它将32个字母的UUID减少到22个字母的Base64。
以下是用于PHP的简单十六进制<-> Base64转换函数:
function hex_to_base64($hex){
$return = '';
foreach(str_split($hex, 2) as $pair){
$return .= chr(hexdec($pair));
}
return preg_replace("/=+$/", "", base64_encode($return)); // remove the trailing = sign, not needed for decoding in PHP.
}
function base64_to_hex($base64) {
$return = '';
foreach (str_split(base64_decode($base64), 1) as $char) {
$return .= str_pad(dechex(ord($char)), 2, "0", STR_PAD_LEFT);
}
return $return;
}