自上周末以来,我一直在研究PDF解析。管理找到Java的Apache PDFBox库,我已经开始提取我正在开发的项目所需的页面,URL,图像和PDF元数据分隔的文本。现在我错过了一种从PDF中提取嵌入式Flash视频的方法。
我目前正在分析此解析器如何从PDF中提取富媒体,为测试目的使用可用的pdf文件here。此文件包含我想要获取的Flash视频。
我已经尝试使用这个approach来搜索PDF中的嵌入文件,但它目前不适合我,因为它找到并保存我创建的文件夹中的任何内容以存储此类文件。
我的代码目前的样子,改编自上述方法。
package myproject;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentNameDictionary;
import org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDNameTreeNode;
import org.apache.pdfbox.pdmodel.common.filespecification.PDComplexFileSpecification;
import org.apache.pdfbox.pdmodel.common.filespecification.PDEmbeddedFile;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationFileAttachment;
/**
* This is an example on how to extract all embedded files from a PDF document.
*
*/
public final class ExtractEmbeddedFiles
{
private ExtractEmbeddedFiles()
{
}
/**
* This is the main method.
*
* @param args The command line arguments.
*
* @throws IOException If there is an error parsing the document.
*/
public static void main( String[] args ) throws IOException
{
PDDocument document = null;
try
{
File pdfFile = new File("/Users/henriqueferreira/Documents/PDFBoxDocuments/inOntario.pdf");
String filePath = pdfFile.getParent() + System.getProperty("file.separator");
document = PDDocument.load(new File("/Users/henriqueferreira/Documents/PDFBoxDocuments/inOntario.pdf"));
PDDocumentNameDictionary namesDictionary =
new PDDocumentNameDictionary( document.getDocumentCatalog() );
PDEmbeddedFilesNameTreeNode efTree = namesDictionary.getEmbeddedFiles();
if (efTree != null)
{
Map<String, PDComplexFileSpecification> names = efTree.getNames();
if (names != null)
{
extractFiles(names, filePath);
}
else
{
List<PDNameTreeNode<PDComplexFileSpecification>> kids = efTree.getKids();
for (PDNameTreeNode<PDComplexFileSpecification> node : kids)
{
names = node.getNames();
extractFiles(names, filePath);
}
}
}
// extract files from annotations
for (PDPage page : document.getPages())
{
for (PDAnnotation annotation : page.getAnnotations())
{
if (annotation instanceof PDAnnotationFileAttachment)
{
PDAnnotationFileAttachment annotationFileAttachment = (PDAnnotationFileAttachment) annotation;
PDComplexFileSpecification fileSpec = (PDComplexFileSpecification) annotationFileAttachment.getFile();
PDEmbeddedFile embeddedFile = getEmbeddedFile(fileSpec);
extractFile(filePath, fileSpec.getFilename(), embeddedFile);
}
}
}
}
finally
{
if( document != null )
{
document.close();
}
}
}
private static void extractFiles(Map<String, PDComplexFileSpecification> names, String filePath)
throws IOException
{
for (Entry<String, PDComplexFileSpecification> entry : names.entrySet())
{
String filename = entry.getKey();
PDComplexFileSpecification fileSpec = entry.getValue();
PDEmbeddedFile embeddedFile = getEmbeddedFile(fileSpec);
extractFile(filePath, filename, embeddedFile);
}
}
private static void extractFile(String filePath, String filename, PDEmbeddedFile embeddedFile)
throws IOException
{
String embeddedFilename = filePath + filename;
File file = new File("/Users/henriqueferreira/Documents/PDFBoxFiles/"+filename);
System.out.println("Writing " + embeddedFilename);
try (FileOutputStream fos = new FileOutputStream(file))
{
fos.write(embeddedFile.toByteArray());
}
}
private static PDEmbeddedFile getEmbeddedFile(PDComplexFileSpecification fileSpec )
{
// search for the first available alternative of the embedded file
PDEmbeddedFile embeddedFile = null;
if (fileSpec != null)
{
embeddedFile = fileSpec.getEmbeddedFileUnicode();
if (embeddedFile == null)
{
embeddedFile = fileSpec.getEmbeddedFileDos();
}
if (embeddedFile == null)
{
embeddedFile = fileSpec.getEmbeddedFileMac();
}
if (embeddedFile == null)
{
embeddedFile = fileSpec.getEmbeddedFileUnix();
}
if (embeddedFile == null)
{
embeddedFile = fileSpec.getEmbeddedFile();
}
}
return embeddedFile;
}
}
所以,我的问题是,从PDF文件中获取此类Flash视频的最合适的方法是什么?
答案 0 :(得分:1)
这里有一些基于我在PDFDebugger中看到的快速代码:
PDDocument doc = PDDocument.load(new File("Mississauga_Advantages.pdf"));
for (int p = 0; p < doc.getNumberOfPages(); ++p)
{
PDPage page = doc.getPage(p);
List<PDAnnotation> annotations = page.getAnnotations();
for (PDAnnotation ann : annotations)
{
if ("RichMedia".equals(ann.getSubtype()))
{
COSArray array = (COSArray) ann.getCOSObject().getObjectFromPath("RichMediaContent/Assets/Names/");
String name = array.getString(0);
COSDictionary filespec = (COSDictionary) array.getObject(1);
PDComplexFileSpecification cfs = new PDComplexFileSpecification(filespec);
PDEmbeddedFile embeddedFile = cfs.getEmbeddedFile();
System.out.println("page: " + (p+1) + ", name: " + name + ", size: " + embeddedFile.createInputStream().available());
}
}
}
您的富媒体正在注释中。所以我浏览了列表并查找了我看到的模式。我不知道这是否是标准的,我在PDF 32000规范中找不到它。 (更新:我在编写代码后发现here)