UnicodeDecodeError:在Python中读取文件名时无效的连续字节

时间:2018-05-15 11:41:29

标签: python unicode utf-8 byte decode

我查看了已在此处发布的各种问题,并尝试了所提到的解决方案但却无法使我的代码正常工作。

我正在尝试使用python读取目录中的文件名,但仍然遇到以下错误。

  

UnicodeDecodeError:' utf-8'编解码器不能解码位置907中的字节0xe8:无效的连续字节

代码

from __future__ import print_function
import sklearn.datasets
import nltk

dataset = sklearn.datasets.load_files('data/', shuffle='False')
categories = dataset.target_names

from os import listdir
from os.path import isfile, join

for c in categories:
    directory_path = 'data/'+c
    onlyfiles = [f for f in listdir(directory_path) if isfile(join(directory_path, f))]
    print ("Level 1 Intent : ", c)
    print ("---------------------------------------")
    for file_name in onlyfiles:
        file_path = directory_path+'/'+file_name
        with open(file_path) as f:
            first_line = f.readline()
            print("Level 2 Intent for ", c, " : ", first_line)

            stopwords = nltk.corpus.stopwords.words('english')
            from nltk.stem.snowball import SnowballStemmer
            stemmer = SnowballStemmer("english")

虽然我得到了根据需要打印的文件名输出,但我遇到了这个错误信息。

示例输出

...
Level 2 Intent for  Presidents  :  Gerald Ford
Level 1 Intent :  Scientists
---------------------------------------
Level 2 Intent for  Scientists  :  Alessandro_Volta
Level 2 Intent for  Scientists  :  Michael_Faraday
Level 2 Intent for  Scientists  :  James_Watt
Level 2 Intent for  Scientists  :  Nikola_Tesla
...

示例文件名

1.clean
2.clean
S08_set3_a1.txt.clean
S08_set3_a8.txt.clean

这方面的任何建议都将受到高度赞赏。

0 个答案:

没有答案