我有一个大型数据集,在尝试使用该数据集时会抛出UnicodeDecodeError
。
我有一个big_file_format_x
,并想将其格式化为big_file_format_y
:
with open(PATH, "r") as data:
for index, line in enumerate(data.readlines()):
// Formatting logic
with open(SAVE_PATH, "w") as new_data:
new_data.write(formated_data_string)
然后将格式化的数据集分为3个数据集:
with open(PATH, "rb") as data:
for line in data.readlines():
// Do splitting logic
with open('train_data.txt', 'wb') as file:
file.write(train_data)
with open('vali_data.txt', 'wb') as file:
file.write(vali_data)
with open('test_data.txt', 'wb') as file:
file.write(test_data)
现在,当我想处理要获取的数据集时
kwargs[name] = annotation.from_params(params=subparams, **subextras)
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/site-packages/allennlp/common/from_params.py", line 274, in from_params
return subclass.from_params(params=params, **extras)
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 136, in from_params
for name, subparams in token_embedder_params.items()
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 136, in <dictcomp>
for name, subparams in token_embedder_params.items()
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/site-packages/allennlp/common/from_params.py", line 274, in from_params
return subclass.from_params(params=params, **extras)
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/site-packages/allennlp/modules/token_embedders/embedding.py", line 200, in from_params
vocab_namespace)
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/site-packages/allennlp/modules/token_embedders/embedding.py", line 270, in _read_pretrained_embeddings_file
vocab, namespace)
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/site-packages/allennlp/modules/token_embedders/embedding.py", line 293, in _read_embeddings_from_text_file
with EmbeddingsTextFile(file_uri) as embeddings_file:
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/site-packages/allennlp/modules/token_embedders/embedding.py", line 450, in __init__
first_line = next(self._handle) # this moves the iterator forward
File "/home/public/s1234/.conda/envs/allennlp/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte
如何避免此错误?
答案 0 :(得分:1)
定义编码应解决此问题:
open(PATH, "r", encoding="UTF-8") # replace UTF-8 with the encoding of your file
在保存文件时,您也应该定义编码。否则,将使用环境的默认编码:python 3.0 open() default encoding
答案 1 :(得分:1)
该文件中的编码可能不同于utf-8。 0xba是ISO-8869-1编码中的º字符。
尝试
data.encode ("utf-8")