我正在尝试将文本与浏览PDF文件内容树的段落相关联。我使用的是PDFBox,但找不到段落与其中包含的文本之间的链接(请参阅下面的代码):
public class ReadPdf {
public static void main( String[] args ) throws IOException{
MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
"C:/Users/wip.txt")));
RandomAccessFile raf = new RandomAccessFile(new File(
"C:/Users/mypdf.pdf"), "r");
PDFParser parser = new PDFParser(raf);
parser.parse();
COSDocument cosDoc = parser.getDocument();
out.write(cosDoc.getXrefTable().toString());
out.write(cosDoc.getObjects().toString());
PDDocument document = parser.getPDDocument()
document.getClass();
COSParser cosParser = new COSParser(raf);
PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
for (Object kid2 :((PDStructureElement)kid).getKids()){
PDStructureElement kid2c = (PDStructureElement)kid2;
if (kid2c.getStandardStructureType() == "P"){
for (Object kid3 : kid2c.getKids()){
if (kid3 instanceof PDStructureElement){
PDStructureElement kid3c = (PDStructureElement)kid3;
}
else{
for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){
// Print all the Keys in the paragraph COSDictionary
System.out.println(entry.getKey().toString());
System.out.println(entry.getValue().toString());}
}}}}}}}
当我打印内容时,我会得到以下密钥:
示例输出:
COSName {K}
COSInt {2}
COSName {PG}
COSObject {12,0}
COSName {C}
COSName {普通}
COSName {A}
COSObject {434,0}
COSName {S}
COSName {普通}
COSName {P}
COSObject {421,0}
答案 0 :(得分:0)
我找到了一种通过解析页面内容流来实现此目的的方法。 浏览PDF规范第10.6.3章,在\ P \ MCID下的每个文本流的编号与可以在COSObject中找到的Tag的属性(PDFBox中的PDStructureElement)之间存在链接。
1)要获取文本和MCID:
router.post(
'/auth/linkedin',
passport.authenticate('linkedin-token'),
nextMiddleware
);
然后获取与MCID匹配的标签及其属性:
PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject()。getInt(COSName.K)
应该这样做。在没有标签的文档中(document.getDocumentCatalog()。getStructureTreeRoot()没有子代),无法执行此匹配,但仍可以使用步骤1读取文本。