Question

我正在构建一个OCR项目，我正在为Tesseract使用.Net包装器。包装器没有显示如何处理PDF作为输入的样本。使用PDF作为输入如何使用c＃生成可搜索的PDF？

我使用ghostscript库将Pdf更改为图像，然后使用它来提供Tesseract，它可以很好地获取文本，但我没有保存Pdf的原始形状我只获得文本

如何从Pdf中获取文本并保存原始Pdf的形状

这是来自pdf的页面我不想要只有文字我希望文字的形状像原来的pdf一样，对不起英文不好

Answer 1

Tesseract支持从3.0版开始创建三明治。但此功能建议使用3.02或3.03。 Pdfsandwich是一个可以或多或少地执行所需操作的脚本。

有一个在线服务www.sandwichpdf.com确实使用tesseract创建可搜索的PDF。在开始使用tesseract实现解决方案之前，您可能希望运行一些测试。结果还可以，但有些商业产品可以提供更好的效果。披露：我是www.sandwichpdf.com的创建者。

Answer 2

仅出于文档方面的考虑，以下是OCR使用tesseract和pdf2image从图像pdf提取文本的示例。

import pdf2image
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract


def pdf_to_img(pdf_file):
    return pdf2image.convert_from_path(pdf_file)


def ocr_core(file):
    text = pytesseract.image_to_string(file)
    return text


def print_pages(pdf_file):
    images = pdf_to_img(pdf_file)
    for pg, img in enumerate(images):
        print(ocr_core(img))


print_pages('sample.pdf')

Answer 3

使用pdf2png.com，然后上传pdf，然后将每个页面的所有png文件都作为<pdf_name>-<page_number>.png创建到.zip文件中，

然后，您可以将简单的python代码编写为

#/usr/bin/python3
#coding:utf-8
import os
pdf_name = 'pdf_name'
language = 'language of tesseract'
for x in range(int('number of pdf_pages')):
    cmd = f'tesseract {pdf_mame}-{x}.png {x} -l {language}'
    os.system(cmd)

然后，从头到尾读取所有文件，例如从1.txt，然后将其追加到单个文件，就这么简单。

Tesseract ocr PDF作为输入

3 个答案: