我查看了已在此处发布的各种问题,并尝试了所提到的解决方案但却无法使我的代码正常工作。
我正在尝试使用python读取目录中的文件名,但仍然遇到以下错误。
UnicodeDecodeError:' utf-8'编解码器不能解码位置907中的字节0xe8:无效的连续字节
代码
from __future__ import print_function
import sklearn.datasets
import nltk
dataset = sklearn.datasets.load_files('data/', shuffle='False')
categories = dataset.target_names
from os import listdir
from os.path import isfile, join
for c in categories:
directory_path = 'data/'+c
onlyfiles = [f for f in listdir(directory_path) if isfile(join(directory_path, f))]
print ("Level 1 Intent : ", c)
print ("---------------------------------------")
for file_name in onlyfiles:
file_path = directory_path+'/'+file_name
with open(file_path) as f:
first_line = f.readline()
print("Level 2 Intent for ", c, " : ", first_line)
stopwords = nltk.corpus.stopwords.words('english')
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
虽然我得到了根据需要打印的文件名输出,但我遇到了这个错误信息。
示例输出
...
Level 2 Intent for Presidents : Gerald Ford
Level 1 Intent : Scientists
---------------------------------------
Level 2 Intent for Scientists : Alessandro_Volta
Level 2 Intent for Scientists : Michael_Faraday
Level 2 Intent for Scientists : James_Watt
Level 2 Intent for Scientists : Nikola_Tesla
...
示例文件名
1.clean
2.clean
S08_set3_a1.txt.clean
S08_set3_a8.txt.clean
这方面的任何建议都将受到高度赞赏。