Question

我正在使用Python的NLTK TaggedCorpusReader来创建一个文本文件集。但是，其中一个文件不是utf-8或具有不受支持的字符。有没有办法告诉哪个文件包含问题？

这是我的代码：

import nltk
corpus=nltk.corpus.TaggedCorpusReader("filepath", '.*.txt', encoding='utf-8') #I added the encoding when I saw some answer about that, but it doesn't seem to help
words=corpus.words()
for w in words:
    print(w)

我的错误：

UnicodeDecodeError：'utf-8'编解码器无法解码位置0的字节0xa0：无效的起始字节

Answer 1

您可以通过一次读取一个文件来识别文件，如下所示：

cat times.txt | awk -F';' '{gsub(/[/:]/," ",$0);d1=mktime("20"substr($1,7,2)" "substr($1,4,2)" "substr($1,1,2)" "$2);d2=mktime("20"substr($3,7,2)" "substr($3,4,2)" "substr($3,1,2)" "$4); print strftime("%H:%M:%S", d2-d1,1);}' > timestamps.txt
paste -d";" times.txt timestamps.txt

（或者你可以在阅读之前打印每个文件名，甚至不用去捕获错误。）

找到文件后，您必须找出问题的根源。你的语料库真的是utf-8编码的吗？也许它正在使用另一种8位编码，例如Latin-1或其他。指定一个8位编码不会给你一个错误（这些格式没有错误检查），但你可以让python打印一些行，看看所选的编码是否正确。

如果你的语料库几乎完全是英文的，你可以在文件中搜索包含非ascii字符的行，并打印出这些：

corpus = nltk.corpus.TaggedCorpusReader("filepath", r'.*\.txt', encoding='utf-8')

try: 
    for filename in corpus.fileids():
        words_ = corpus.words(filename)
except UnicodeDecodeError:
    print("UnicodeDecodeError in", filename)

在语料库Python中查找损坏的文件

1 个答案: