Question

我正在尝试使用Textract在doem PDF文件中提取文本。但是，当我在代码末尾打印文本时，它只会打印出很多空白。谁能指出我的情况？（顺便说一下，文本不是=“”）

import os
import codecs
import PyPDF2 
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

for filename in os.listdir('Harbour PDF'):
    if '.DS_Store' == filename:
        continue
    filename = 'Harbour PDF/' + filename
    print(filename)

    pdfFileObj = open(filename,'rb')

    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    num_pages = pdfReader.numPages
    count = 0
    text = ""

    while count < num_pages:
        pageObj = pdfReader.getPage(count)
        count +=1
        text += pageObj.extractText()


    if text != "":
        text = text
    else:
        text = textract.process(pdfFileObj, method='tesseract', language='eng')

    print(text)

Answer 1

我通过python使用的2个函数（第二个需要tesseract）。好吧，我实际上更喜欢使用tesseract而不是pdfminer，但是它们实际上可以做同样的事情。不确定您的代码有什么问题，但是我相信这些是替代的等效项。

const meetupTable = CREATE TABLE IF NOT EXISTS
        meetups(
            id UUID PRIMARY KEY,
            topic VARCHAR(128) NOT NULL,
            location VARCHAR(128) NOT NULL,
            body TEXT NOT NULL,
            happeningOn TIMESTAMPTZ NOT NULL,
            Tags TEXT[] NOT NULL,
            meetupImage bytea,
            createdOn TIMESTAMPTZ DEFAULT Now()
        )

使用Textract提取/抓取PDF-不打印文本

1 个答案: