Question

文档对我来说并不是很清楚。到目前为止，我认为我需要设置一个CGPDFOperatorTable，然后为每个PDF页面创建一个CGPDFContentStreamCreateWithPage和CGPDFScannerCreate。

文档是指设置回调，但我不清楚如何。如何从页面实际获取内容？

到目前为止，这是我的代码。

    let pdfURL = NSBundle.mainBundle().URLForResource("titleofdocument", withExtension: "pdf")

    // Create pdf document
    let pdfDoc = CGPDFDocumentCreateWithURL(pdfURL)

    // Nr of pages in this PF
    let numberOfPages = CGPDFDocumentGetNumberOfPages(pdfDoc) as Int

    if numberOfPages <= 0 {
        // The number of pages is zero
        return
    }

    let myTable = CGPDFOperatorTableCreate()

    // lets go through every page
    for pageNr in 1...numberOfPages {

        let thisPage = CGPDFDocumentGetPage(pdfDoc, pageNr)
        let myContentStream = CGPDFContentStreamCreateWithPage(thisPage)
        let myScanner = CGPDFScannerCreate(myContentStream, myTable, nil)

        CGPDFScannerScan(myScanner)

        // Search for Content here?
        // ??

        CGPDFScannerRelease(myScanner)
        CGPDFContentStreamRelease(myContentStream)

    }

    // Release Table
    CGPDFOperatorTableRelease(myTable)

这是一个类似的问题：PDF Parsing with SWIFT但还没有答案。

Answer 1

以下是Swift中实现的回调示例：

    let operatorTableRef = CGPDFOperatorTableCreate()

    CGPDFOperatorTableSetCallback(operatorTableRef, "BT") { (scanner, info) in
        print("Begin text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "ET") { (scanner, info) in
        print("End text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "Tf") { (scanner, info) in
        print("Select font")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "Tj") { (scanner, info) in
        print("Show text")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "TJ") { (scanner, info) in
        print("Show text, allowing individual glyph positioning")
    }

    let numPages = CGPDFDocumentGetNumberOfPages(pdfDocument)
    for pageNum in 1...numPages {
        let page = CGPDFDocumentGetPage(pdfDocument, pageNum)
        let stream = CGPDFContentStreamCreateWithPage(page)
        let scanner = CGPDFScannerCreate(stream, operatorTableRef, nil)
        CGPDFScannerScan(scanner)
        CGPDFScannerRelease(scanner)
        CGPDFContentStreamRelease(stream)
    }

Answer 2

您实际上确切地指明了如何操作，您需要做的就是将它放在一起并尝试直到它工作。

首先，你需要设置一个带回调的表，当你在问题的开头陈述自己时（Objective C中的所有代码，不是Swift）：

CGPDFOperatorTableRef operatorTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback(operatorTable, "q", &op_q);
CGPDFOperatorTableSetCallback(operatorTable, "Q", &op_Q);

此表包含您要调用的PDF运算符列表，并将回调与它们相关联。那些回调只是你在其他地方定义的函数：

static void op_q(CGPDFScannerRef s, void *info) {
    // Do whatever you have to do in here
    // info is whatever you passed to CGPDFScannerCreate
}

static void op_Q(CGPDFScannerRef s, void *info) {
    // Do whatever you have to do in here
    // info is whatever you passed to CGPDFScannerCreate
}

然后你创建扫描仪并开始运行，同时传递你刚刚定义的信息。

// Passing "self" is just an example, you can pass whatever you want and it will be provided to your callback whenever it is called by the scanner.
CGPDFScannerRef contentStreamScanner = CGPDFScannerCreate(contentStream, operatorTable, self);

CGPDFScannerScan（contentStreamScanner）;

如果您想查看有关如何查找和处理图像的源代码的完整示例，请check this website。

Answer 3

要了解解析器为何以这种方式工作，您需要更好地阅读PDF规范。 PDF文件包含接近打印说明的内容。例如“移至此坐标，打印此字符，移至此处，更改颜色，从字体＃23打印字符号23”，等等。

解析器为您提供了每条指令的回调，并可以检索指令参数。就是这样。

因此，为了从文件中获取内容，您需要手动重建其状态。这意味着，重新计算所有字符的框架，然后尝试对页面布局进行反向工程。显然，这不是一件容易的事，这就是人们创建库来这样做的原因。

您可能想看看PDFKitten或PDFParser，它是我做了一些改进的Swift端口。

如何使用Swift解析PDF页面中的内容

3 个答案: