如何在java中使用apache tika从PDF文件中获取页眉和页脚

时间:2013-08-12 12:02:34

标签: java pdfbox apache-tika

我正在使用apache tika来抓取pdf文件中的内容。抓取的内容(文本)也包含页眉和页脚。我的要求是获取没有页眉和页脚的文本。下面是我抓取内容的示例代码。 示例代码:

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Date;
import java.util.List;
import java.util.Set;
import java.util.TreeMap;
import org.apache.commons.io.FileUtils;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.BodyContentHandler;
import org.json.simple.JSONObject;

public class test {

    public static void main(String[] args) throws Exception {

            String file = "C://Sample.pdf";
            File file1 = new File(file);
            InputStream input = new FileInputStream(file1);
            Metadata metadata = new Metadata();
            BodyContentHandler handler = new BodyContentHandler(
                    10 * 1024 * 1024);
            AutoDetectParser parser = new AutoDetectParser();
            parser.parse(input, handler, metadata);
            String path = "C://AUG7th".concat("/").concat(file1.getName())
                    .concat(".txt");
            String content = handler.toString();
            File file2 = new File(path);
            FileWriter fw = new FileWriter(file2.getAbsoluteFile());
            BufferedWriter bw = new BufferedWriter(fw);
            bw.write(content);
            bw.close();

    }

}

如何做到这一点请建议我。 感谢

1 个答案:

答案 0 :(得分:0)

我还没有找到一种方法来解析使用Tika的PDF格式的标题或页脚。你还需要另一个api来做PDFTextSTream

编辑:确定.. Tika将(尝试)从pdf中提取原始文本和元数据。
您需要解析和分析原始文本以删除标题和页脚。 我建议使用PDFTextStream而不是Tika,因为它将简化为此目的实现算法的任务。 当您使用PDFTextStream解析pdf时,您可以提取非简单字符的TextUnits,但它们也“携带”其他信息。您还可以选择文本区域,此外,您还可以选择维护每个页面的可视布局。

@Gagravarr PDF格式的XHTML输出

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
**<head>**
<meta name="dcterms:modified" content="2012-11-21T16:08:42Z"/>
<meta name="meta:creation-date" content="2010-06-22T07:00:09Z"/>
<meta name="meta:save-date" content="2012-11-21T16:08:42Z"/>
<meta name="Content-Length" content="702419"/>
<meta name="Last-Modified" content="2012-11-21T16:08:42Z"/>
<meta name="dcterms:created" content="2010-06-22T07:00:09Z"/>
<meta name="date" content="2012-11-21T16:08:42Z"/>
<meta name="modified" content="2012-11-21T16:08:42Z"/>
<meta name="xmpTPg:NPages" content="20"/>
<meta name="Creation-Date" content="2010-06-22T07:00:09Z"/>
<meta name="created" content="Tue Jun 22 09:00:09 CEST 2010"/>
<meta name="producer" content="Atypon Systems, Inc."/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="xmp:CreatorTool" content="PDFplus"/>
<meta name="resourceName" content="Lessons from a High-Impact Observatory The Hubble Space Telescope.pdf"/>
<meta name="Last-Save-Date" content="2012-11-21T16:08:42Z"/>
<meta name="dc:title" content="Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008"/>
<title>Lessons from a High-Impact Observatory: The &lt;italic&gt;Hubble Space Telescopes&lt;/italic&gt; Science Productivity between 1998 and 2008</title>
**</head>**
**<body>**<div class="page"><p/>
<p>Lessons from a High-Impact Observatory: The Hubble Space Telescope’s Science Productivity
between 1998 and 2008
Author(s): Dániel Apai, Jill Lagerstrom, Iain Neill Reid, Karen L. Levay, Elizabeth Fraser,
Antonella Nota, and Edwin Henneken
Reviewed work(s):
Source: Publications of the Astronomical Society of the Pacific, Vol. 122, No. 893 (July 2010),
pp. 808-826
Published by: The University of Chicago Press on behalf of the Astronomical Society of the Pacific
Stable URL: http://www.jstor.org/stable/10.1086/654851 .
Accessed: 21/11/2012 11:08
</p>
<p>Your use of the JSTOR archive indicates your acceptance of the Terms &amp; Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
</p>
<p> .
</p>
<p>JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
</p>................**</body>**

中,Tika为我们提供了它找到的元数据,在正文中,它为我们提供了分段的文本(看起来有点笨拙),它也可以给我们注释链接。所以,我认为它没有用处。