我正尝试使用如下所示的“数据块API”来加载和预处理我的文本数据文件(包括一些不一定用ISO-8859-1
编码的字符)。
运行下面的脚本后,出现错误
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position Detail error: 951: invalid continuation byte
有人知道如何解决吗?
“data_lm = (TextList.from_folder(path)
#Inputs: all the text files in path
.filter_by_folder(include=[‘Jobs’])
#We may have other temp folders that contain text files so we only keep
what’s in train and test
.split_by_rand_pct(0.1)
#We randomly split and keep 10% (10,000 reviews) for validation
.label_for_lm()
#We want to do a language model so we label accordingly
.databunch(bs=bs))
data_lm.save(‘data_lm.pkl’)”
“UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xe9 in position
Detail error: 951: invalid continuation byte”.``