拆分PDF

Question

有人知道在PDF文档中对文本进行矢量化的方法吗？也就是说，我希望每个字母都是形状/轮廓，没有任何文字内容。我正在使用Linux系统，首选开源或非Windows解决方案。

上下文：我正在尝试编辑一些旧PDF，我不再使用这些字体。我想在Inkscape中这样做，但是这会用通用的字体替换所有字体，而且几乎不可读。我也使用pdf2ps和ps2pdf来回转换，但字体信息仍保留在那里。因此，当我将它加载到Inkscape中时，它看起来仍然很糟糕。

有什么想法吗？感谢。

Answer 1

要实现这一目标，您必须：

将PDF拆分为单独的页面;
将您的PDF页面转换为SVG;
编辑您想要的页面
重新组合页面

这个答案将省略第3步，因为那是不可编程的。

拆分PDF

如果您不想以编程方式分割文档，那么现代方式就是使用stapler。在你最喜欢的外壳中：

stapler burst file.pdf

会生成{file_1.pdf,...,file_N.pdf}，其中1...N是PDF页面。订书机本身使用PyPDF2，拆分PDF文件的代码并不复杂。以下函数拆分文件并将各个页面保存在当前目录中。（从commands.py文件无耻地复制）

import math
import os
from PyPDF2 import PdfFileWriter, PdfFileReader

def split(filename):
    with open(filename) as inputfp:
        inputpdf = PdfFileReader(inputfp)

        base, ext = os.path.splitext(os.path.basename(filename))

        # Prefix the output template with zeros so that ordering is preserved
        # (page 10 after page 09)
        output_template = ''.join([
            base,
            '_',
            '%0',
            str(math.ceil(math.log10(inputpdf.getNumPages()))),
            'd',
            ext
        ])

        for page in range(inputpdf.getNumPages()):
            outputpdf = PdfFileWriter()
            outputpdf.addPage(inputpdf.getPage(page))

            outputname = output_template % (page + 1)

            with open(outputname, 'wb') as fp:
                outputpdf.write(fp)

将各个页面转换为SVG

现在要将PDF转换为可编辑文件，我可能会使用pdf2svg。

pdf2svg input.pdf output.svg

如果我们看一下pdf2svg.c文件，我们可以看到代码原则上并不复杂（假设输入文件名在filename变量中，输出文件名在outputname变量）。下面是python中的一个最小工作示例。它需要pycairo和pypoppler库：

import os

import cairo
import poppler

def convert(inputname, outputname):
    # Convert the input file name to an URI to please poppler
    uri = 'file://' + os.path.abspath(inputname)

    pdffile = poppler.document_new_from_file(uri, None)

    # We only have one page, since we split prior to converting. Get the page
    page = pdffile.get_page(0)

    # Get the page dimensions
    width, height = page.get_size()

    # Open the SVG file to write on
    surface = cairo.SVGSurface(outputname, width, height)
    context = cairo.Context(surface)

    # Now we finally can render the PDF to SVG
    page.render_for_printing(context)
    context.show_page()

此时你应该有一个SVG，其中所有文本都已转换为路径，并且能够使用Inkscape进行编辑而不会出现渲染问题。

合并步骤1和2

您可以在for循环中调用pdf2svg来执行此操作。但是你需要预先知道页数。下面的代码显示了页数，并且只需一步即可完成转换。它只需要pycairo和pypoppler：

import os, math

import cairo
import poppler

def convert(inputname, base=None):
    '''Converts a multi-page PDF to multiple SVG files.

    :param inputname: Name of the PDF to be converted
    :param base: Base name for the SVG files (optional)
    '''
    if base is None:
        base, ext = os.path.splitext(os.path.basename(inputname))

    # Convert the input file name to an URI to please poppler
    uri = 'file://' + os.path.abspath(inputname)

    pdffile = poppler.document_new_from_file(uri, None)

    pages = pdffile.get_n_pages()

    # Prefix the output template with zeros so that ordering is preserved
    # (page 10 after page 09)
    output_template = ''.join([
        base,
        '_',
        '%0',
        str(math.ceil(math.log10(pages))),
        'd',
        '.svg'
    ])

    # Iterate over all pages
    for nthpage in range(pages):
        page = pdffile.get_page(nthpage)

        # Output file name based on template
        outputname = output_template % (nthpage + 1)

        # Get the page dimensions
        width, height = page.get_size()

        # Open the SVG file to write on
        surface = cairo.SVGSurface(outputname, width, height)
        context = cairo.Context(surface)

        # Now we finally can render the PDF to SVG
        page.render_for_printing(context)
        context.show_page()

        # Free some memory
        surface.finish()

将SVG组装成单个PDF

要重新组合，您可以使用inkscape / stapler对手动转换文件。但编写执行此操作的代码并不难。下面的代码使用rsvg和cairo。要从SVG转换并将所有内容合并为单个PDF：

import rsvg
import cairo

def convert_merge(inputfiles, outputname):
    # We have to create a PDF surface and inform a size. The size is
    # irrelevant, though, as we will define the sizes of each page
    # individually.
    outputsurface = cairo.PDFSurface(outputname, 1, 1)
    outputcontext = cairo.Context(outputsurface)

    for inputfile in inputfiles:
        # Open the SVG
        svg = rsvg.Handle(file=inputfile)

        # Set the size of the page itself
        outputsurface.set_size(svg.props.width, svg.props.height)

        # Draw on the PDF
        svg.render_cairo(outputcontext)

        # Finish the page and start a new one
        outputcontext.show_page()

    # Free some memory
    outputsurface.finish()

PS：应该可以使用命令pdftocairo，但它似乎不会调用render_for_printing()，这会使输出SVG保持字体信息。

Answer 2

这是你真正想要的 - 字体替换。您希望某些代码/应用程序能够浏览该文件并对嵌入字体进行适当的更改。

这项任务是可行的，从简单到非平凡。当您的字体与文件中的字体的度量匹配并且用于字体的编码是合理的时，这很容易。您可以使用iText或DotPdf（后者在评估之外不是免费的，并且是我公司的产品）。如果您修改了pdf2ps，您可能也可以管理在途中更改字体。

如果文件中使用的字体是具有广告素材重新编码的字体子集，那么您就处于地狱状态，可能会有各种各样的痛苦进行更改。这就是原因：

PostScript是在没有Unicode时设计的。 Adobe使用单个字节表示字符，每当您呈现任何字符串时，绘制的字形都取自256条名为编码向量的表。如果标准编码没有您想要的内容，我们鼓励您根据仅在编码方面不同的标准字体动态制作字体。

当Adobe创建Acrobat时，他们希望尽可能简单地从PostScript过渡，以便对字体机制进行建模。当添加将字体嵌入到PDF中的能力时，很明显这会使文件膨胀，因此PDF还包括具有字体子集的能力。字体子集是通过获取现有字体并删除所有未使用的字形并将其重新编码为PDF来制作的。编码矢量和文件中的代码点之间可能没有标准关系 - 所有这些都可以改变。相反，可能是嵌入式PostScript函数/ ToUnicode，它将编码字符转换为Unicode表示。

所以是的，非平凡的。

Answer 3

我害怕对你需要原始字体（或许多工作）的PDF进行矢量化。

浮现在脑海中的一些可能性：

使用pdftk转储未压缩的PDF并发现字体名称，然后在FontMonster或other字体服务上查找。
使用一些online font recognition service来获得与您的字体的紧密匹配，以便保留字距（我猜字距和对齐是使您的文字无法读取的内容）
尝试replacing the fonts manually（再次pdftk将PDF转换为可使用sed进行修改的PDF。此修改将打破 PDF，但pdftk将能够将损坏的PDF重新压缩为可用的PDF格式。

Answer 4

对于那些追随我的人：我找到的最佳解决方案是使用Evince打印为SVG，或者使用可通过Synint on Mint访问的pdf2svg程序。但是，Inkscape无法处理生成的SVG - 它进入了一个带有错误消息的无限循环：

File display/nr-arena-item.cpp line 323 (?): Assertion item->state & NR_ARENA_ITEM_STATE_BBOX failed

我现在放弃了这个任务，但也许我会在一两年内再试一次。与此同时，这些解决方案中的一个可能适合您。

将PDF文本转换为轮廓？

4 个答案:

拆分PDF

将各个页面转换为SVG

合并步骤1和2

将SVG组装成单个PDF