Python PYPDF2:' utf-8'编解码器不能解码位置395中的字节0x80:无效的起始字节

时间:2018-05-31 15:01:49

标签: nlp nltk pypdf2

我使用教程创建了一个pdf文件集。我有以下代码:

import nltk
import PyPDF2
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from PyPDF2 import PdfFileReader

def getTextPDF(pdfFileName):
pdf_file = open(pdfFileName, 'rb')
readpdf = PdfFileReader(pdf_file)
text = []
for i in range(0,readpdf.getNumPages()):
    text.append(readpdf.getPage(i).extractText())
return '\n'.join(text)

corpusDir = 'reports/'

jun15 = getTextPDF('reports/June2015.pdf')
dec15 = getTextPDF('reports/December2015.pdf')
jun16 = getTextPDF('reports/June2016.pdf')
dec16 = getTextPDF('reports/December2016.pdf')
jun17 = getTextPDF('reports/June2017.pdf')
dec17 = getTextPDF('reports/December2017.pdf')

files = [jun15,dec15,jun16,dec16,jun17,dec17]
for idx, f in enumerate(files):
    with open (corpusDir+str(idx)+'.txt','w') as output:
        output.write(f)

corpus = PlaintextCorpusReader(corpusDir, '.*')

print (corpus.words())
        

UnicodeDecodeError Traceback(最近一次调用   最后)in()   ----> 1打印(corpus.words())

     

/anaconda3/lib/python3.6/site-packages/nltk/collections.py in   的再版(个体)       224件= []       225长度= 5    - > 226 elt in self:       227 pieces.append(repr(elt))       228长度+ = len(件[-1])+ 2

     

/anaconda3/lib/python3.6/site-packages/nltk/corpus/reader/util.py in   iterate_from(self,start_tok)       400       401#从这件作品中获取我们所能提供的一切。    - > 402 for tok in piece.iterate_from(max(0,start_tok-offset)):       403收益率       404

     

/anaconda3/lib/python3.6/site-packages/nltk/corpus/reader/util.py in   iterate_from(self,start_tok)       294 self._current_toknum = toknum       295 self._current_blocknum = block_index    - > 296 tokens = self.read_block(self._stream)       297断言isinstance(令牌,(元组,列表,AbstractLazySequence)),(       298'块阅读器%s()应该返回列表或元组。' %

     

/anaconda3/lib/python3.6/site-packages/nltk/corpus/reader/plaintext.py   在_read_word_block中(self,stream)       120字= []       121范围内的i(20):#一次读取20行。    - > 122 words.extend(self._word_tokenizer.tokenize(stream.readline()))       123返回单词       124

     readline中的/anaconda3/lib/python3.6/site-packages/nltk/data.py(self,   大小)1166而真:1167 startpos =   self.stream.tell() - len(self.bytebuffer)    - > 1168 new_chars = self._read(readsize)1169 1170#如果我们在' \ r',那么请读一个额外的字符,因为

     _read中的/anaconda3/lib/python3.6/site-packages/nltk/data.py(self,   size)1398 1399#将字节解码为unicode   人物    - > 1400个字符,bytes_decoded = self._incr_decode(字节)1401 1402#如果我们得到字节但无法解码,那么   进一步阅读。

     

/anaconda3/lib/python3.6/site-packages/nltk/data.py in   _incr_decode(self,bytes)1429而True:1430尝试:    - > 1431返回self.decode(bytes,' strict')1432除了UnicodeDecodeError为exc:1433#如果   异常发生在字符串的末尾,

     

/anaconda3/lib/python3.6/encodings/utf_8.py解码(输入,错误)        14        15 def解码(输入,错误='严格'):   ---> 16返回codecs.utf_8_decode(输入,错误,True)        17        18类IncrementalEncoder(codecs.IncrementalEncoder):

     

UnicodeDecodeError:' utf-8'编解码器不能将字节0x80解码到位   395:无效的起始字节

我一直在查看不同的帖子,但我仍然无法判断问题是我使用了错误的方法还是我必须编码或解码某些内容。如果是后者我不知道在哪里。任何想法,将不胜感激。

1 个答案:

答案 0 :(得分:0)

最好看到整个错误消息,但我猜你正在使用python 2,你的报告中有一些utf-8。首先,尝试在开头和打开文件时指定编码:

#!/usr/bin/python
#-*- coding:utf-8 -*- 
import nltk
import PyPDF2
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from PyPDF2 import PdfFileReader
import codecs
def getTextPDF(pdfFileName):
    pdf_file = codecs.open(pdfFileName, 'rb', encoding='utf8')
    readpdf = PdfFileReader(pdf_file)
    text = []
    for i in range(0,readpdf.getNumPages()):
        text.append(readpdf.getPage(i).extractText())
    return '\n'.join(text)

corpusDir = 'reports/'

jun15 = getTextPDF('reports/June2015.pdf')
dec15 = getTextPDF('reports/December2015.pdf')
jun16 = getTextPDF('reports/June2016.pdf')
dec16 = getTextPDF('reports/December2016.pdf')
jun17 = getTextPDF('reports/June2017.pdf')
dec17 = getTextPDF('reports/December2017.pdf')

files = [jun15,dec15,jun16,dec16,jun17,dec17]
for idx, f in enumerate(files):
    with codecs.open(corpusDir+str(idx)+'.txt','w', encoding='utf8') as output:
        output.write(f)

corpus = PlaintextCorpusReader(corpusDir, '.*')

print (corpus.words())

如果这样做不起作用,你可以尝试避开你的琴弦,但这并不理想:

def toUtf8(stringOrUnicode):
    '''
    Returns the argument in utf-8 encoding
    '''
    typeArg = type(stringOrUnicode)
    if typeArg is unicode:
        return stringOrUnicode.encode('utf8').decode('utf8')
    elif typeArg is str:
        return stringOrUnicode.decode('utf8')

否则,请向我们显示消息错误,以尝试并准确检测问题所在。