使用以下库,希望从Python中的 PDF 文件中提取关键字:
这是做什么:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
def parse_pdf(file_name):
filename = request.args.get(file_name)
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
tokens = word_tokenize(text)
punctuations = ['(', ')', ';', ':', '[', ']', ',']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
return(keywords)
解析PDF,但是返回的关键字聚集在一起,如下所示:
"defectswillbesendintotrainedneuralnetworkmodeltogetresults"
"anewtechniquethatistoclassifythedefectsusingneural"
"technologybasedonneuralnetwork"
"convolutionalneuralnetworkhighlightsoutstand-"
如何确保不同的单词之间用空格隔开?