如何使用PDFBox读取PDF文件内容中的特殊字符串

时间:2017-11-30 03:34:15

标签: java

我想编写一个程序来提取论文中的主题,作者,摘要和其他信息。可以这样做吗?我该怎么办?

1 个答案:

答案 0 :(得分:0)

假设您已将pdfbox jar添加到项目中,下面是您检索PDF的一些基本文档属性的代码

import java.io.File; 
import java.io.IOException;

import org.apache.pdfbox.pdmodel.PDDocument; 
import org.apache.pdfbox.pdmodel.PDDocumentInformation;

    public class readPdf {
       public static void main(String args[]) throws IOException {

          //Loading an existing document 

          File file = new File("C:/Users/user1/Desktop/test.pdf");

          PDDocument document = PDDocument.load(file);
          //Getting the PDDocumentInformation object
          PDDocumentInformation pdd = document.getDocumentInformation();

          //Retrieving the info of a PDF document
          System.out.println("Author of the document is :"+ pdd.getAuthor());
          System.out.println("Title of the document is :"+ pdd.getTitle());
          System.out.println("Subject of the document is :"+ pdd.getSubject());

          System.out.println("Creator of the document is :"+ pdd.getCreator());
          System.out.println("Creation date of the document is :"+ pdd.getCreationDate());
          System.out.println("Modification date of the document is :"+ 
             pdd.getModificationDate()); 
          System.out.println("Keywords of the document are :"+ pdd.getKeywords()); 

          //Closing the document 
          document.close();        
       }  
    }    

有关更多文档属性,请参阅here。 HTH。