如何修复错误“无法在类似字节的对象上使用字符串模式”?

时间:2019-09-25 02:30:42

标签: python

我正在尝试按照此tutorial读取pdf文件并将其转换为文本的方式,但是我一直遇到错误。这是我的python代码

import PyPDF2 
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()

if text != "":
   text = text

else:
   text = textract.process(fileurl, method='tesseract', language='eng')


tokens = word_tokenize(text)

punctuations = ['(',')',';',':','[',']',',']

stop_words = stopwords.words('english')

keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

我一直遇到的错误是

  

tokens = word_tokenize(text)

     

TypeError:无法在类似字节的对象上使用字符串模式

如何解决该错误?

1 个答案:

答案 0 :(得分:1)

您正在读取字节数据,但是您需要一个字符串,因为word_tokenize在后​​端使用regex

更改此行:

tokens = word_tokenize(text.decode("utf-8"))