Question

我需要构建一个python编码器，以便我可以重新格式化这样的字符串：

import codecs
codecs.encode("Random  UTF-8 String ☑⚠⚡", 'name_of_my_encoder')

这甚至是我要求堆栈溢出的原因是，编码的字符串需要通过此验证功能。这是一个硬约束，没有灵活性，因为必须存储字符串。

from string import ascii_letters
from string import digits

valid_characters = set(ascii_letters + digits + ['_'])

def validation_function(characters):
    for char in characters:
        if char not in valid_characters:
            raise Exception

使编码器看起来很容易，但我不确定这种编码器是否会使构建解码器变得更加困难。这是我写过的编码器。

from codecs import encode
from string import ascii_letters
from string import digits

ALPHANUMERIC_SET = set(ascii_letters + digits)

def underscore_encode(chars_in):
    chars_out = list()
    for char in chars_in:
        if char not in ALPHANUMERIC_SET:
            chars_out.append('_{}_'.format(encode(char.encode(), 'hex').decode('ascii')))
        else:
            chars_out.append(char)
    return ''.join(chars_out)

这是我编写的编码器。我只是将它包含在例如目的中，这可能是更好的方法。

编辑1 - 有人明智地指出在整个字符串上使用base32，我绝对可以使用它。但是，最好选择具有某种可读性的内容，因此首选转发系统如https://en.wikipedia.org/wiki/Quoted-printable或https://en.wikipedia.org/wiki/Percent-encoding。

编辑2 - 提议的解决方案必须适用于Python 3.4或更高版本，在Python 2.7中工作也很好，但不是必需的。我已经添加了python-3.x标签，以帮助澄清一点。

Answer 1

这似乎可以解决问题。基本上，单独使用字母数字字母。 ASCII集中的任何非字母数字字符都被编码为\xXX转义码。所有其他unicode字符都使用\uXXXX转义码进行编码。但是，您已说过不能使用\，但可以使用_，因此所有转义序列都会转换为以_开头。这使得解码非常简单。只需将_替换为\，然后使用unicode-escape编解码器。编码稍微困难，因为unicode-escape编解码器单独留下ASCII字符。因此，首先必须转义相关的ASCII字符，然后在unicode-escape编解码器中运行字符串，最后将所有\转换为_。

代码：

from string import ascii_letters, digits

# non-translating characters
ALPHANUMERIC_SET = set(ascii_letters + digits)    
# mapping all bytes to themselves, except '_' maps to '\'
ESCAPE_CHAR_DECODE_TABLE = bytes(bytearray(range(256)).replace(b"_", b"\\"))
# reverse mapping -- maps `\` back to `_`
ESCAPE_CHAR_ENCODE_TABLE = bytes(bytearray(range(256)).replace(b"\\", b"_"))
# encoding table for ASCII characters not in ALPHANUMERIC_SET
ASCII_ENCODE_TABLE = {i: u"_x{:x}".format(i) for i in set(range(128)) ^ set(map(ord, ALPHANUMERIC_SET))}



def encode(s):
    s = s.translate(ASCII_ENCODE_TABLE) # translate ascii chars not in your set
    bytes_ = s.encode("unicode-escape")
    bytes_ = bytes_.translate(ESCAPE_CHAR_ENCODE_TABLE)
    return bytes_

def decode(s):
    s = s.translate(ESCAPE_CHAR_DECODE_TABLE)
    return s.decode("unicode-escape")

s = u"Random UTF-8 String ☑⚠⚡"
#s = '北亰'
print(s)
b = encode(s)
print(b)
new_s = decode(b)
print(new_s)

哪个输出：

Random UTF-8 String ☑⚠⚡
b'Random_x20UTF_x2d8_x20String_x20_u2611_u26a0_u26a1'
Random UTF-8 String ☑⚠⚡

这适用于python 3.4和python 2.7，这就是为什么ESCAPE_CHAR_{DE,EN}CODE_TABLE有点凌乱bytes在python 2.7上是str的别名，它与{{bytes的工作方式不同1）}在python 3.4上。这就是使用bytearray构建表的原因。对于python 2.7，encode方法期望unicode对象不是str。

Answer 2

您可以滥用url quoting，以通过验证功能的其他语言格式获取可读性和易于解码的内容：

Random_20_F0_9F_90_8D_20UTF_2d8_20String_20_E2_98_91_E2_9A_A0_E2_9A_A1
Random  UTF-8 String ☑⚠⚡

输出

bytearray()

以下是使用#!/usr/bin/env python3.5 from string import ascii_letters, digits def alnum_encode(text, alnum=bytearray(ascii_letters+digits, 'ascii')): result = bytearray() for byte in bytearray(text, 'utf-8'): if byte in alnum: result.append(byte) else: result += b'_%02x' % byte return result.decode('ascii')的实现（如有必要，稍后将其移至C）：

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance
xmlns:wsse="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd
xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd">
<soap:Header>
<wsse:Security soap:mustUnderstand="1">
<wsse:UsernameToken wsu:Id="UsernameToken-14867177">
<wsse:Username>WSADVINS</wsse:Username>
<wsse:Password
Type="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-username-token-profile-1.0#PasswordDigest"> gQEg0mep2fappUlJKKlA3B8K73g=
</wsse:Password>
<wsse:Nonce>o7wgBBqpzrDWOlSIBIHm7Q==</wsse:Nonce>
<wsu:Created>2007-02-20T19:45:56.456Z</wsu:Created>
</wsse:UsernameToken>
</wsse:Security>
</soap:Header>
<soap:Body>
<!-- Message Body Payload Goes here -->
</soap:Body>

Answer 3

尽管有好几个答案。我最终找到了一个更清晰，更易理解的解决方案。所以我发布了最终解决方案的代码来回答我自己的问题。

from string import ascii_letters
from string import digits
from base64 import b16decode
from base64 import b16encode


ALPHANUMERIC_SET = set(ascii_letters + digits)


def utf8_string_to_hex_string(s):
    return ''.join(chr(i) for i in b16encode(s.encode('utf-8')))


def hex_string_to_utf8_string(s):
    return b16decode(bytes(list((ord(i) for i in s)))).decode('utf-8')


def underscore_encode(chars_in):
    chars_out = list()
    for char in chars_in:
        if char not in ALPHANUMERIC_SET:
            chars_out.append('_{}_'.format(utf8_string_to_hex_string(char)))
        else:
            chars_out.append(char)
    return ''.join(chars_out)


def underscore_decode(chars_in):
    chars_out = list()
    decoding = False
    for char in chars_in:
        if char == '_':
            if not decoding:
                hex_chars = list()
                decoding = True
            elif decoding:
                decoding = False
                chars_out.append(hex_string_to_utf8_string(hex_chars))
        else:
            if not decoding:
                chars_out.append(char)
            elif decoding:
                hex_chars.append(char)
    return ''.join(chars_out)

如何使用＆＃34; A-Z＆＃34;，＆＃34; a-z＆＃34;，＆＃34; 0-9＆＃34;和＆＃34; _＆＃34编码UTF-8字符串;在Python中

3 个答案:

输出