public static void main(String[] args) throws FileNotFoundException, IOException, TikaException, SAXException {
// TODO code application logic here
InputStream input = new FileInputStream("/home/alican/Downloads/solr-4.10.2/example/solr/senior/solr-word.pdf");
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
new PDFParser().parse(input, handler, metadata, new ParseContext());
String plainText = handler.toString();
System.out.println(handler.toString());
System.out.println(metadata.toString());
}
我可以打印PDF和元数据信息的内容。当l打印metadata.toString()
输出就像
access_permission:extract_for_accessibility=true meta:save-date=2008-11-13T13:35:51Z dc:subject=solr, word, pdf subject=solr word dcterms:created=2008-11-13T13:35:51Z Author=Grant Ingersoll date=2008-11-13T13:35:51Z
.....(so on)
如何只选择作者,标题和页码?
编辑:解决方案:
String[] author = metadata.getValues(Metadata.AUTHOR);
System.out.println(Arrays.toString(author));