我已经 java代码编写使用pdfbox api 从url pdf链接中提取数据.i已成功获取文本格式的整个数据。但pdf文件包含文章相关信息,如标题,作者姓名和禁运日期,我想提取的不是全文数据。是否有方式只获取选定的数据来自pdf使用pdfbox。
URL url = new URL("http://www.example.com");
connection = (HttpURLConnection) url.openConnection();
connection.setRequestProperty("Authorization", "Basic " + encodedString);
connection.connect();
input = connection.getInputStream();
FileOutputStream fos1 = new FileOutputStream("download.pdf");
(....perform writing operation )
File in = new File("download.pdf");
PDFParser parser = new PDFParser(new FileInputStream(in));
parser.parse();
COSDocument cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
PDDocument pdDoc = new PDDocument(cosDoc);
String parsedText = pdfStripper.getText(pdDoc);