使用PyPDF2和textract解析PDF文件;返回的关键字聚集在一起

时间:2019-11-11 21:49:33

标签: python nltk text-extraction pypdf2

使用以下库,希望从Python中的 PDF 文件中提取关键字

  • PyPDF2
  • textract
  • nltk

这是做什么:

import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def parse_pdf(file_name):
    filename = request.args.get(file_name)
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    count = 0
    text = ""
    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count += 1
        text += pageObj.extractText()
    if text != "":
        text = text
    else:
        text = textract.process(fileurl, method='tesseract', language='eng')
    tokens = word_tokenize(text)
    punctuations = ['(', ')', ';', ':', '[', ']', ',']
    stop_words = stopwords.words('english')
    keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
    return(keywords)

解析PDF,但是返回的关键字聚集在一起,如下所示:

"defectswillbesendintotrainedneuralnetworkmodeltogetresults"
"anewtechniquethatistoclassifythedefectsusingneural"
"technologybasedonneuralnetwork"
"convolutionalneuralnetworkhighlightsoutstand-"

如何确保不同的单词之间用空格隔开?

0 个答案:

没有答案