从PDF文件仅复制必要的对象

时间:2019-03-21 15:31:09

标签: c++ pdf podofo

我有一个巨大的pdf文件,包含100多页,我想将它们分离为单个pdf文件(每个文件仅包含一页)。问题是,由于引用的原因,podofo不会仅复制页面,而是复制整个文档(因此100个pdf文件中的每个文件都具有与100个页面pdf相同的大小)。可以找到相关的邮件列表enter link description here,很遗憾,没有提供解决方案。

函数InsertPages的源代码中有解释

  

此功能的工作方式与预期的有所不同。         而不是一次复制一页-我们复制整个文档         然后删除我们不感兴趣的页面。

  We do this because 
  1) SIGNIFICANTLY simplifies the process
  2) Guarantees that shared objects aren't copied multiple times
  3) offers MUCH faster performance for the common cases
     

但是:因为PoDoFo当前不执行任何类型的“对象   垃圾收集”期间         一个Write()-我们将得到更大的文档,因为来自未使用页面的数据         也会在那里。

我尝试了几种方法来仅复制相关对象,但是每个方法都失败了。

  • 复制所有页面并删除不相关的页面
  • 使用xobject包装FillXObjectFromDocumentPageFillXObjectFromExistingPage
  • 逐个复制对象
  • RenumberObjectsbDoGarbageCollection = true一起使用

但是他们都没有解决。是否有人对此问题有想法或可行的解决方案?

2 个答案:

答案 0 :(得分:2)

唯一的解决方案是使用另一个PDF库。或等待垃圾回收实施。

问题在您提到的报价中指出:

> during a Write() - we will end up with larger documents, since the
> data from unused pages will also be in there.

这意味着podofo始终将整个PDF内容放入文件中,无论如何。整个PDF都在那里,您只是看不到它的一部分。

答案 1 :(得分:0)

支持中的丹尼斯给我带来了灵感,因此即使我还没有测试它,我也将其发布为答案。完成后,我将使用正确的解决方案更新答案。

void PdfMemDocument::InsertPages2(const PdfMemDocument & rDoc, std::vector<int> pageNumbers)
{
    std::unordered_set<PdfObject*> totalSet;
    std::vector<pdf_objnum> oldObjNumPages;
    std::unordered_map<pdf_objnum, pdf_objnum> oldObjNumToNewObjNum;

    std::vector<PdfObject*> newPageObjects;

    // Collect all dependencies from all pages that are to be copied
    for (int i = 0; i < pageNumbers.size(); ++i) {
        PdfPage* page = rDoc.GetPage(pageNumbers[i]);
        if (page) {
            oldObjNumPages.push_back(page->GetObject()->Reference().ObjectNumber());
            std::unordered_set<PdfObject*> *set = page->GetPageDependencies();
            totalSet.insert(set->begin(), set->end());
            delete set;
        }
    }

    // Create a new page object for every copied page from the old document
    // Copy all objects the pages depend on to the new document
    for (auto it = totalSet.begin(); it != totalSet.end(); ++it) {
        unsigned int length = static_cast<unsigned int>(GetObjects().GetSize() + GetObjects().GetFreeObjects().size());
        PdfReference ref(static_cast<unsigned int>(length+1), 0);
        PdfObject* pObj = new PdfObject(ref, *(*it));
        pObj->SetOwner(&(GetObjects()));
        if ((*it)->HasStream()) {
            PdfStream *stream = (*it)->GetStream();
            pdf_long length;
            char* buf;
            stream->GetCopy(&buf, &length);
            PdfMemoryInputStream inputStream(buf, length);
            pObj->GetStream()->SetRawData(&inputStream, length);
            free(buf);

        }
        oldObjNumToNewObjNum.insert(std::pair<pdf_objnum, pdf_objnum>((*it)->Reference().ObjectNumber(), length+1));
        GetObjects().push_back(pObj);
        newPageObjects.push_back(pObj);
    }

    // In all copied objects, fix the object numbers so they are valid in the new document
    for (auto it = newPageObjects.begin(); it != newPageObjects.end(); ++it) {
        FixPageReferences(GetObjects(), *it, oldObjNumToNewObjNum);
    }

    // Insert the copied pages into the pages tree
    for (auto it = oldObjNumPages.begin(); it != oldObjNumPages.end(); ++it) {
        PdfObject* pageObject = GetObjects().GetObject(PdfReference(oldObjNumToNewObjNum[(*it)], 0));
        PdfPage *page = new PdfPage(pageObject, std::deque<PdfObject*>());
        GetPagesTree()->InsertPage(GetPageCount() - 1, page);
    }

}

std::unordered_set<PdfObject *>* PdfPage::GetPageDependencies() const
{
    std::unordered_set<PdfObject *> *set = new std::unordered_set<PdfObject *>();

    const PdfObject* pageObj = GetObject();
    if (pageObj) {
        PdfVecObjects* objects = pageObj->GetOwner();
        if (objects) {
            set->insert((PdfObject*)pageObj);
            objects->GetObjectDependencies2(pageObj, *set);
        }
    }

    return set;
}

// Optimized version of PdfVecObjects::GetObjectDependencies
void PdfVecObjects::GetObjectDependencies2(const PdfObject* pObj, std::unordered_set<PdfObject*> &refMap) const
{
    // Check objects referenced from this object
    if (pObj->IsReference())
    {
        PdfObject* referencedObject = GetObject(pObj->GetReference());
        if (referencedObject != NULL && refMap.count(referencedObject) < 1) {
            (refMap).insert((PdfObject *)referencedObject); // Insert referenced object
            GetObjectDependencies2((const PdfObject*)referencedObject, refMap);
        }
    }
    else {
        // Recursion
        if (pObj->IsArray())
        {
            PdfArray::const_iterator itArray = pObj->GetArray().begin();
            while (itArray != pObj->GetArray().end())
            {
                GetObjectDependencies2(&(*itArray), refMap);
                ++itArray;
            }
        }
        else if (pObj->IsDictionary())
        {
            TCIKeyMap itKeys = pObj->GetDictionary().GetKeys().begin();
            while (itKeys != pObj->GetDictionary().GetKeys().end())
            {
                if ((*itKeys).first != PdfName("Parent")) {
                    GetObjectDependencies2((*itKeys).second, refMap);
                }
                ++itKeys;
            }
        }
    }
}