我正在尝试解析一些家庭自动化想法的PDF。
我正在尝试查看从pdf中可以获得的数据。我正在测试的PDF在这里: http://www.antrimandnewtownabbey.gov.uk/getmedia/ebfd33ba-d176-462b-99e3-9416b774f7bc/BIN-FLYER-THUR-CYC-B-FULL-YEAR-December-16-November-17.pdf.aspx
所以我使用以下代码来解析PDF:
public class WebPagePdfExtractor {
public Map<String, Object> processRecord(String url) {
DefaultHttpClient httpclient = new DefaultHttpClient();
Map<String, Object> map = new HashMap<String, Object>();
try {
HttpGet httpGet = new HttpGet(url);
HttpResponse response = httpclient.execute(httpGet);
HttpEntity entity = response.getEntity();
InputStream input = null;
if (entity != null) {
try{
input = entity.getContent();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext parseContext = new ParseContext();
parser.parse(input, handler, metadata, parseContext);
map.put("text", handler.toString().replaceAll("\n|\r|\t", " "));
map.put("title", metadata.get(TikaCoreProperties.TITLE));
map.put("pageCount", metadata.get("xmpTPg:NPages"));
map.put("status_code", response.getStatusLine().getStatusCode() + "");
} catch (Exception e) {
e.printStackTrace();
}finally{
if(input != null){
try {
input.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}catch (Exception exception) {
exception.printStackTrace();
}
return map;
}
public static void main(String[] args) {
WebPagePdfExtractor webPagePdfExtractor = new WebPagePdfExtractor();
Map<String, Object> extractedMap = webPagePdfExtractor.processRecord("http://www.antrimandnewtownabbey.gov.uk/getmedia/ebfd33ba-d176-462b-99e3-9416b774f7bc/BIN-FLYER-THUR-CYC-B-FULL-YEAR-December-16-November-17.pdf.aspx");
System.out.println(extractedMap.get("text"));
}
}
这将从PDF中完美地返回所有文本,我想知道进一步采取这种方式是否可以获得PDF中某些图像的描述。例如,在每个日期旁边是彩色图像是否有办法获取此信息?