Question

您好，在那里，在Google上进行了长时间的研究之后，以及我自己编写的各种代码，我偶然遇到了关于堆栈溢出的问题。

我想做的是，我有一个pdf文件。其中包含注释和对注释的建议，这些注释显示在带注释的单词的鼠标悬停时。例如。考虑上面的图片，其中您要花费的单词是删除线（表示不正确的单词），并且在鼠标悬停于其上方的情况下，会弹出一个弹出窗口，其中显示了正确的单词。类似地，还有另一个插入符号也是如此。

我想提取两个单词的列表，这将显示文件中正确和错误的单词。有谁知道如何做到这一点。任何建议将不胜感激。

Answer 1

我只是使用SetaPDF-Extractor组件（我们的商业产品）进行了简单的POC，结果是：

可悲的是，PDF中的注释“树”并不是那么简单。 POC只是迭代注释，然后创建过滤器，然后由提取器组件使用。 Here是另一个演示，它提取了评论树，这可能是排序/逻辑性更高的结果的基础。

这是我用于给定输出的代码：

<?php
// load and register the autoload function
require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = SetaPDF_Core_Document::loadByFilename('camtown/Terms-and-Conditions - revised.pdf');
    // initate an extractor instance
$extractor = new SetaPDF_Extractor($document);

// get page documents pages object
$pages = $document->getCatalog()->getPages();

// we are going to save the extracted text in this variable
$results = [];
// map pages and filternames to annotation instances
$annotationsByPageAndFilterName = [];

// iterate over all pages
for ($pageNo = 1, $pageCount = $pages->count(); $pageNo <= $pageCount; $pageNo++) {
    // get the page object
    $page = $pages->getPage($pageNo);
    // get the annotations
    $annotations = array_filter($page->getAnnotations()->getAll(), function(SetaPDF_Core_Document_Page_Annotation $annotation) {
        switch ($annotation->getType()) {
            case SetaPDF_Core_Document_Page_Annotation::TYPE_HIGHLIGHT:
            case SetaPDF_Core_Document_Page_Annotation::TYPE_STRIKE_OUT:
            case SetaPDF_Core_Document_Page_Annotation::TYPE_CARET:
            case SetaPDF_Core_Document_Page_Annotation::TYPE_UNDERLINE:
                return true;
        }

        return false;
    });

    // create a strategy instance
    $strategy = new SetaPDF_Extractor_Strategy_ExactPlain();
    // create a multi filter instance
    $filter = new SetaPDF_Extractor_Filter_Multi();
    // and pass it to the strategy
    $strategy->setFilter($filter);

    // iterate over all highlight annotations
    foreach ($annotations AS $tmpId => $annotation) {
        /**
         * @var SetaPDF_Core_Document_Page_Annotation_Highlight $annotation
         */
        $name = 'P#' . $pageNo . '/HA#' . $tmpId;
        if ($annotation->getName()) {
            $name .= ' (' . $annotation->getName() . ')';
        }

        if ($annotation instanceof SetaPDF_Core_Document_Page_Annotation_TextMarkup) {
            // iterate over the quad points to setup our filter instances
            $quadpoints = $annotation->getQuadPoints();
            for ($pos = 0, $c = count($quadpoints); $pos < $c; $pos += 8) {
                $llx = min($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]) - 1;
                $urx = max($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]) + 1;
                $lly = min($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]) - 1;
                $ury = max($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]) + 1;

                // reduze it to a small line
                $diff = ($ury - $lly) / 2;
                $lly = $lly + $diff - 1;
                $ury = $ury - $diff - 1;

                // Add a new rectangle filter to the multi filter instance
                $filter->addFilter(
                    new SetaPDF_Extractor_Filter_Rectangle(
                        new SetaPDF_Core_Geometry_Rectangle($llx, $lly, $urx, $ury),
                        SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
                        $name
                    )
                );
            }
        }

        $annotationsByPageAndFilterName[$pageNo][$name] = $annotation;
    }

    // if no filters for this page defined, ignore it
    if (count($filter->getFilters()) === 0) {
        continue;
    }

    // pass the strategy to the extractor instance
    $extractor->setStrategy($strategy);
    // and get the results by the current page number
    $result = $extractor->getResultByPageNumber($pageNo);
    if ($result === '')
        continue;

    $results[$pageNo] = $result;
}

// debug output
foreach ($annotationsByPageAndFilterName AS $pageNo => $annotations) {
    echo '<h1>Page No #' . $pageNo . '</h1>';
    echo '<table border="1"><tr><th>Name</th><th>Text</th><th>Subject</th><th>Comment</th></tr>';
    foreach ($annotations AS $name => $annotation) {
        echo '<tr>';
        echo '<td>' . $name . '</td>';
        echo '<td><pre>' . ($results[$pageNo][$name] ?? '') . '</pre></td>';
        echo '<td><pre>' . $annotation->getSubject() . '</pre></td>';
        echo '<td><pre>' . $annotation->getContents() . '</pre></td>';
        echo '</tr>';
    }

    echo '</table>';
}

Answer 2

您尝试过此解析器吗？

功能

加载和解析对象和标头
提取元数据（作者，描述，关键字等）
从有序页面中提取文本
支持压缩的pdf（不支持）
支持字符集编码（WinAnsi，MacRoman）
六进制和八进制内容编码的处理
符合PSR-0（自动装载器）
与Composer兼容
符合PSR-1（代码样式）

https://pdfparser.org/demo

Answer 3

您需要提取有关页面上存在的标记注释及其相关子弹出窗口（您称为“建议”）注释内容的信息。您可以使用标记批注的位置来与页面上该位置显示的文本保持一致。然后，您将获得所需的两条信息。

从pdf文件中读取注释

3 个答案: