Question

tf.gfile.GFile（）不接受＆＃39;编码＆＃39;论点。从here我收集到gfile只返回一个字节流，但现在似乎已改变为：

with tf.gfile.GFile("./data/squad/test1.txt", mode = "rb") as file1:
    print(file1.read(n = 2), type(file1.read(n = 2)))
with tf.gfile.GFile("./data/squad/test1.txt", mode = "r") as file1:
    print(file1.read(n = 2), type(file1.read(n = 2)))

输出：

b'as' <class 'bytes'>
as <class 'str'>

那么在读取这些字符串时它使用的编码究竟是什么？它是utf8还是平台依赖，就像python中的开放协议一样？

Answer 1

据我了解的实现，tf.io.gfile.GFile始终使用UTF-8：https://github.com/tensorflow/tensorflow/blob/b3376f73ccfd6ae8721a946daf064675ee19b427/tensorflow/python/lib/io/file_io.py#L100

def write(self, file_content):
  """Writes file_content to the file. Appends to the end of the file."""
  self._prewrite_check()
  self._writable_file.append(compat.as_bytes(file_content))

它正在使用tf.compat.as_bytes将str转换为bytes，并将其编码为UTF-8：

Answer 2

只是一个byte-stream，所以您可以自己决定文本的字节编码是什么。

您可以使用库来检测编码并将其用作解码方法。截至今天（2020年6月），最佳的python编码检测库之一是chardet，这是Mozilla的杰出人才

pip install chardet

如果您知道是'utf-8'，则可以使用

对其进行解码

import chardet

bstream = file1.read()
info = chardet.detect(bstream)
enc = info['encoding']
info['confidence']
text = bstream.decode(enc)

在tf.gfile.GFile（）中使用的编码是什么？

2 个答案: