了解pdfBox 2.0.3中的TextPosition getTextMatrix()方法

时间:2016-11-24 06:17:05

标签: pdfbox

我试图发现pdf页面内的角色定位。我已经实现了一个扩展TextPosition的类PDFTextStripper,我已根据页面结构使用setSortByPosition(true)来安排charactersByArticle。我有一个pdf(这里是link,请参阅第10页),其中我有横向和potrait字符。我的TextPosition类在每个页面都包含所有已排序的字符。现在,当我看到第10页时,风景人物在potrait字符顺序之后出现。这是代码:

public static void main( String[] args ) throws IOException
    {
        TextExtractor textExtractor = new TextExtractor("C:\\Users\\prabhjot.rai\\Desktop\\Demo\\sbet201601.pdf");
        PDFPage page = textExtractor.getPages().get(10);
        for (List<TextPosition> article: page.getCharactersByArticle()) {
            for (TextPosition text: article) {
                System.out.print(text.getUnicode() + " ");
                System.out.print(text.getX() + " ");
                System.out.print(text.getTextMatrix() + " ");
                System.out.println();
            }
        }
    }

以下是第10页输出的最后一行:

1 406.46 [7.787,0.0,0.0,7.7706,406.46,64.9355] 
4 410.64938 [7.787,0.0,0.0,7.7706,410.64938,64.9355] 
Y 250.9882 [7.787,0.0,0.0,7.7706,250.9882,46.5761] 
E 256.3768 [7.787,0.0,0.0,7.7706,256.3768,46.5761] 
A 261.7654 [7.787,0.0,0.0,7.7706,261.7654,46.5761] 
R 267.15402 [7.787,0.0,0.0,7.7706,267.15402,46.5761] 
P 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,118.1203] 
e 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,123.499886] 
r 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,127.680466] 
c 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,130.07147] 
e 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,134.25516] 
n 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,138.43575] 
t 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,142.61633] 
9 486.465 [0.0,7.98,-7.98,0.0,486.465,37.2] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,41.220325] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,43.199364] 
| 486.465 [0.0,7.98,-7.98,0.0,486.465,45.178402] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,46.678642] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,48.838825] 
N 486.465 [0.0,7.98,-7.98,0.0,486.465,50.817863] 
F 486.465 [0.0,7.98,-7.98,0.0,486.465,56.579422] 
I 486.465 [0.0,7.98,-7.98,0.0,486.465,61.016304] 
B 486.465 [0.0,7.98,-7.98,0.0,486.465,63.65609] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,68.99631] 
S 486.465 [0.0,7.98,-7.98,0.0,486.465,71.09585] 
m 486.465 [0.0,7.98,-7.98,0.0,486.465,75.53273] 
a 486.465 [0.0,7.98,-7.98,0.0,486.465,81.59274] 
l 486.465 [0.0,7.98,-7.98,0.0,486.465,85.135864] 
l 486.465 [0.0,7.98,-7.98,0.0,486.465,87.3543] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,89.57274] 
B 486.465 [0.0,7.98,-7.98,0.0,486.465,91.79278] 
u 486.465 [0.0,7.98,-7.98,0.0,486.465,97.132996] 
s 486.465 [0.0,7.98,-7.98,0.0,486.465,101.15252] 
i 486.465 [0.0,7.98,-7.98,0.0,486.465,104.2727] 
n 486.465 [0.0,7.98,-7.98,0.0,486.465,106.491135] 
e 486.465 [0.0,7.98,-7.98,0.0,486.465,110.51146] 
s 486.465 [0.0,7.98,-7.98,0.0,486.465,114.05458] 
s 486.465 [0.0,7.98,-7.98,0.0,486.465,117.17476] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,120.294136] 
E 486.465 [0.0,7.98,-7.98,0.0,486.465,122.334625] 
c 486.465 [0.0,7.98,-7.98,0.0,486.465,127.19444] 
o 486.465 [0.0,7.98,-7.98,0.0,486.465,130.73756] 
n 486.465 [0.0,7.98,-7.98,0.0,486.465,134.75789] 
o 486.465 [0.0,7.98,-7.98,0.0,486.465,138.77742] 
m 486.465 [0.0,7.98,-7.98,0.0,486.465,142.79774] 
i 486.465 [0.0,7.98,-7.98,0.0,486.465,148.85776] 
c 486.465 [0.0,7.98,-7.98,0.0,486.465,151.0762] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,154.61932] 
T 486.465 [0.0,7.98,-7.98,0.0,486.465,156.71886] 
r 486.465 [0.0,7.98,-7.98,0.0,486.465,161.57947] 
e 486.465 [0.0,7.98,-7.98,0.0,486.465,164.21924] 
n 486.465 [0.0,7.98,-7.98,0.0,486.465,167.76236] 
d 486.465 [0.0,7.98,-7.98,0.0,486.465,171.78268] 
s 486.465 [0.0,7.98,-7.98,0.0,486.465,175.80222] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,178.9224] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,180.84238] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,182.82141] 
M 486.465 [0.0,7.98,-7.98,0.0,486.465,184.8] 
o 486.465 [0.0,7.98,-7.98,0.0,486.465,191.46011] 
n 486.465 [0.0,7.98,-7.98,0.0,486.465,195.48204] 
t 486.465 [0.0,7.98,-7.98,0.0,486.465,199.50397] 
h 486.465 [0.0,7.98,-7.98,0.0,486.465,201.72401] 
l 486.465 [0.0,7.98,-7.98,0.0,486.465,205.74594] 
y 486.465 [0.0,7.98,-7.98,0.0,486.465,207.96599] 
  486.465 [0.0,7.98,-7.98,0.0,486.465,211.50592] 
R 486.465 [0.0,7.98,-7.98,0.0,486.465,213.48495] 
e 486.465 [0.0,7.98,-7.98,0.0,486.465,218.34477] 
p 486.465 [0.0,7.98,-7.98,0.0,486.465,221.8847] 
o 486.465 [0.0,7.98,-7.98,0.0,486.465,225.90663] 
r 486.465 [0.0,7.98,-7.98,0.0,486.465,229.92856] 
t 486.465 [0.0,7.98,-7.98,0.0,486.465,233.04794] 

我可以肯定地看到getTextMatrix()为landscape和potrait字符产生不同的矩阵。 (Potrait:[7.787,0.0,0.0,7.7706,406.46,64.9355],风景:[0.0,7.7706,-7.787,0.0,94.1495,118.1203],风景在矩阵的第一个位置以零开始)。我想更多地理解矩阵,以便我可以清楚地区分两种类型。我提到了documentation,但无法理解。任何人都可以就此分享一些文章或想法吗?

0 个答案:

没有答案