Question

使用以下库，希望从Python中的 PDF 文件中提取关键字：

PyPDF2
textract
nltk

这是做什么：

import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

def parse_pdf(file_name):
    filename = request.args.get(file_name)
    pdfFileObj = open(filename, 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    num_pages = pdfReader.numPages
    count = 0
    text = ""
    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count += 1
        text += pageObj.extractText()
    if text != "":
        text = text
    else:
        text = textract.process(fileurl, method='tesseract', language='eng')
    tokens = word_tokenize(text)
    punctuations = ['(', ')', ';', ':', '[', ']', ',']
    stop_words = stopwords.words('english')
    keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
    return(keywords)

解析PDF，但是返回的关键字聚集在一起，如下所示：

"defectswillbesendintotrainedneuralnetworkmodeltogetresults"
"anewtechniquethatistoclassifythedefectsusingneural"
"technologybasedonneuralnetwork"
"convolutionalneuralnetworkhighlightsoutstand-"

如何确保不同的单词之间用空格隔开？

使用PyPDF2和textract解析PDF文件；返回的关键字聚集在一起

0 个答案: