我目前正在处理一个问题,但似乎似乎找不到合适的解决方案。也许你们可以帮我,谢谢!
BASE64
进行编码,该值在ùtf-8
之前)
requests.get(url, stream=True
iter_content(chunk_size=1020, decode_unicode=False)
chunk%4!=0
Base64
块
base64.b64decode(lines_prepared_after_stream).decode('utf-8')
utf-8
,以至于我无法正确地将其写入文件。我的想法出了错的任何想法?
示例数据(我的情况是该数据,但只是txt文件大小的倍数): JSON base64:
{
"blob":"bmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXM="
}
TXT文件:
nam
tom
sarah
tim
nim
bim
sven
monika
chris
bla
blub
bom
sdfsdfsddasdasdas
def stream_decode_write(document_ids, client: requests.Session, chunk_size=1024):
f = open(os.path.join(constants.DEFAULT_DATA_UPLOAD_DIR, document_id),
"wb")
r = client.get(document_url, stream=True, verify=False)
for lines in r.iter_content(chunk_size, decode_unicode=True):
# filter keep-alive
if lines:
lines = replace_json_input(lines)
missing_padding = len(lines) % 4
if missing_padding:
lines += '=' * (4 - missing_padding)
decoded = base64.b64decode(lines.strip()).decode('utf-8)
f.write(decoded)
我用来编码的代码
def encode_to_base64(file_to_encode) -> str:
byte_coding = file_to_encode.encode()
data = base64.b64encode(byte_coding)
return data.decode('ascii')
def read_from_file(file_path):
f = open(file_path, "r", encoding='utf-8')
return f.read()
我猜想某些块需要一些字节作为前缀/之后才是正确的utf-8
。也许我需要某种方式,但是我不确定。
还是Iam将信息划分到错误的地方,而这些地方实际上是需要信息连续的。
我调查了这个:Validate that a stream of bytes is valid UTF-8 (or other encoding) without copy 没有成功。
可能有用: 我确实有关于文件编码的能力。 目前,我正在将文件加载到内存中,并对其进行编码,放入utf8,然后根据json发送到上游。
基本编码:
"aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRh"
无法解码的部分:
ib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRh
即使我添加了填充
日志输出:
2019-04-03 10:51:26,549 - chunk_size: 1024
2019-04-03 10:51:26,549 - before replace: {"blob":"aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJ
2019-04-03 10:51:26,549 - replaced chunk : aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJ
2019-04-03 10:51:26,549 - chunk_size after change: 1015
答案 0 :(得分:1)
保留base64解码数据二进制,因为这些块可能会在多字节UTF-8序列之间中断。
这是一个人为的例子。原始数据经过UTF-8分解,并作为'blob'
键的base64编码值放入字典中。请注意,以UTF-8编码时,使用的中文输入为每个字符3个字节,因此块大小5将在块之间将一个字符的字节分解。然后读取JSON数据,对Blob进行解码,并将二进制块写入文件中。完成后,文件将解码为UTF-8。
#!coding=utf8
import json
import base64
def test_input():
'''Test function that returns JSON base64 blobs from
encoded from five bytes at a time.
'''
chunk = 5
data = '我去过上海。'.encode('utf8')
for i in range(0,len(data),chunk):
yield json.dumps({'blob':base64.b64encode(data[i:i+chunk]).decode('ascii')})
with open('output.txt','wb') as f:
for chunk in test_input():
print(chunk)
# decode and add each binary result to a file.
parsed = json.loads(chunk)
decoded = base64.b64decode(parsed['blob'])
f.write(decoded)
# Read in the resulting file as UTF-8 and display
with open('output.txt',encoding='utf8') as f:
print(f.read())
输出:
{"blob": "5oiR5Y4="}
{"blob": "u+i/h+Q="}
{"blob": "uIrmtbc="}
{"blob": "44CC"}
我去过上海。
答案 1 :(得分:0)
因为我正在流式传输,所以不能确保收到的每个字节都是正确的多字节Unicode。因此可能发生解码失败。正如@mark已经提到的那样。考虑到这一点,我研究了另一个解决方案。
看起来,解码器确保接收到的数据是否为多字节Unicode,如果块不完整,它将失败。
data = '我去过上海。'
让我们说这是5个字节,但我只收到4个字节。
decoded = base64.b64decode(data). Decode('utf-8')
将失败。因此,我可能需要下一个块中的一个字节。我可以将其包装在一个except块中,然后重试。 如果它不是多字节的,它可以解码它,那么该贪婪算法应该有望工作。如果它是多字节的,它将失败并贪婪直到它可以对其进行解码。
prev_chunk.append(current_chunk.pop(0))
我需要重复执行此操作,直到接收到正确的已解码UTF-8或我的prev_chunk
和current_chunk
导致按字节合并为零为止。