从PDF解析文本和图像

时间:2017-01-01 20:46:27

标签: java parsing pdf apache-tika

我正在尝试解析一些家庭自动化想法的PDF。

我正在尝试查看从pdf中可以获得的数据。我正在测试的PDF在这里: http://www.antrimandnewtownabbey.gov.uk/getmedia/ebfd33ba-d176-462b-99e3-9416b774f7bc/BIN-FLYER-THUR-CYC-B-FULL-YEAR-December-16-November-17.pdf.aspx

所以我使用以下代码来解析PDF:

public class WebPagePdfExtractor {
    public Map<String, Object> processRecord(String url) {
    DefaultHttpClient httpclient = new DefaultHttpClient();
    Map<String, Object> map = new HashMap<String, Object>();
    try {
        HttpGet httpGet = new HttpGet(url);
        HttpResponse response = httpclient.execute(httpGet);
        HttpEntity entity = response.getEntity();
        InputStream input = null;
                if (entity != null) {
                    try{
                        input = entity.getContent();
                        BodyContentHandler handler = new BodyContentHandler();
                        Metadata metadata = new Metadata();
                        AutoDetectParser parser = new AutoDetectParser();
                        ParseContext parseContext = new ParseContext();
                        parser.parse(input, handler, metadata, parseContext);
                        map.put("text", handler.toString().replaceAll("\n|\r|\t", " "));
                        map.put("title", metadata.get(TikaCoreProperties.TITLE));
                        map.put("pageCount", metadata.get("xmpTPg:NPages"));
                        map.put("status_code", response.getStatusLine().getStatusCode() + "");
                } catch (Exception e) {                     
                    e.printStackTrace();
                }finally{
                    if(input != null){
                        try {
                            input.close();
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    }
                }
                }
            }catch (Exception exception) {
                exception.printStackTrace();
            }
    return map;
}

public static void main(String[] args) {
    WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
    Map<String, Object> extractedMap = webPagePdfExtractor.processRecord("http://www.antrimandnewtownabbey.gov.uk/getmedia/ebfd33ba-d176-462b-99e3-9416b774f7bc/BIN-FLYER-THUR-CYC-B-FULL-YEAR-December-16-November-17.pdf.aspx");
    System.out.println(extractedMap.get("text"));
   }
}

这将从PDF中完美地返回所有文本,我想知道进一步采取这种方式是否可以获得PDF中某些图像的描述。例如,在每个日期旁边是彩色图像是否有办法获取此信息?

0 个答案:

没有答案