我试图发现pdf页面内的角色定位。我已经实现了一个扩展TextPosition
的类PDFTextStripper
,我已根据页面结构使用setSortByPosition(true)
来安排charactersByArticle
。我有一个pdf(这里是link,请参阅第10页),其中我有横向和potrait字符。我的TextPosition
类在每个页面都包含所有已排序的字符。现在,当我看到第10页时,风景人物在potrait字符顺序之后出现。这是代码:
public static void main( String[] args ) throws IOException
{
TextExtractor textExtractor = new TextExtractor("C:\\Users\\prabhjot.rai\\Desktop\\Demo\\sbet201601.pdf");
PDFPage page = textExtractor.getPages().get(10);
for (List<TextPosition> article: page.getCharactersByArticle()) {
for (TextPosition text: article) {
System.out.print(text.getUnicode() + " ");
System.out.print(text.getX() + " ");
System.out.print(text.getTextMatrix() + " ");
System.out.println();
}
}
}
以下是第10页输出的最后一行:
1 406.46 [7.787,0.0,0.0,7.7706,406.46,64.9355]
4 410.64938 [7.787,0.0,0.0,7.7706,410.64938,64.9355]
Y 250.9882 [7.787,0.0,0.0,7.7706,250.9882,46.5761]
E 256.3768 [7.787,0.0,0.0,7.7706,256.3768,46.5761]
A 261.7654 [7.787,0.0,0.0,7.7706,261.7654,46.5761]
R 267.15402 [7.787,0.0,0.0,7.7706,267.15402,46.5761]
P 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,118.1203]
e 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,123.499886]
r 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,127.680466]
c 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,130.07147]
e 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,134.25516]
n 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,138.43575]
t 94.1495 [0.0,7.7706,-7.787,0.0,94.1495,142.61633]
9 486.465 [0.0,7.98,-7.98,0.0,486.465,37.2]
486.465 [0.0,7.98,-7.98,0.0,486.465,41.220325]
486.465 [0.0,7.98,-7.98,0.0,486.465,43.199364]
| 486.465 [0.0,7.98,-7.98,0.0,486.465,45.178402]
486.465 [0.0,7.98,-7.98,0.0,486.465,46.678642]
486.465 [0.0,7.98,-7.98,0.0,486.465,48.838825]
N 486.465 [0.0,7.98,-7.98,0.0,486.465,50.817863]
F 486.465 [0.0,7.98,-7.98,0.0,486.465,56.579422]
I 486.465 [0.0,7.98,-7.98,0.0,486.465,61.016304]
B 486.465 [0.0,7.98,-7.98,0.0,486.465,63.65609]
486.465 [0.0,7.98,-7.98,0.0,486.465,68.99631]
S 486.465 [0.0,7.98,-7.98,0.0,486.465,71.09585]
m 486.465 [0.0,7.98,-7.98,0.0,486.465,75.53273]
a 486.465 [0.0,7.98,-7.98,0.0,486.465,81.59274]
l 486.465 [0.0,7.98,-7.98,0.0,486.465,85.135864]
l 486.465 [0.0,7.98,-7.98,0.0,486.465,87.3543]
486.465 [0.0,7.98,-7.98,0.0,486.465,89.57274]
B 486.465 [0.0,7.98,-7.98,0.0,486.465,91.79278]
u 486.465 [0.0,7.98,-7.98,0.0,486.465,97.132996]
s 486.465 [0.0,7.98,-7.98,0.0,486.465,101.15252]
i 486.465 [0.0,7.98,-7.98,0.0,486.465,104.2727]
n 486.465 [0.0,7.98,-7.98,0.0,486.465,106.491135]
e 486.465 [0.0,7.98,-7.98,0.0,486.465,110.51146]
s 486.465 [0.0,7.98,-7.98,0.0,486.465,114.05458]
s 486.465 [0.0,7.98,-7.98,0.0,486.465,117.17476]
486.465 [0.0,7.98,-7.98,0.0,486.465,120.294136]
E 486.465 [0.0,7.98,-7.98,0.0,486.465,122.334625]
c 486.465 [0.0,7.98,-7.98,0.0,486.465,127.19444]
o 486.465 [0.0,7.98,-7.98,0.0,486.465,130.73756]
n 486.465 [0.0,7.98,-7.98,0.0,486.465,134.75789]
o 486.465 [0.0,7.98,-7.98,0.0,486.465,138.77742]
m 486.465 [0.0,7.98,-7.98,0.0,486.465,142.79774]
i 486.465 [0.0,7.98,-7.98,0.0,486.465,148.85776]
c 486.465 [0.0,7.98,-7.98,0.0,486.465,151.0762]
486.465 [0.0,7.98,-7.98,0.0,486.465,154.61932]
T 486.465 [0.0,7.98,-7.98,0.0,486.465,156.71886]
r 486.465 [0.0,7.98,-7.98,0.0,486.465,161.57947]
e 486.465 [0.0,7.98,-7.98,0.0,486.465,164.21924]
n 486.465 [0.0,7.98,-7.98,0.0,486.465,167.76236]
d 486.465 [0.0,7.98,-7.98,0.0,486.465,171.78268]
s 486.465 [0.0,7.98,-7.98,0.0,486.465,175.80222]
486.465 [0.0,7.98,-7.98,0.0,486.465,178.9224]
486.465 [0.0,7.98,-7.98,0.0,486.465,180.84238]
486.465 [0.0,7.98,-7.98,0.0,486.465,182.82141]
M 486.465 [0.0,7.98,-7.98,0.0,486.465,184.8]
o 486.465 [0.0,7.98,-7.98,0.0,486.465,191.46011]
n 486.465 [0.0,7.98,-7.98,0.0,486.465,195.48204]
t 486.465 [0.0,7.98,-7.98,0.0,486.465,199.50397]
h 486.465 [0.0,7.98,-7.98,0.0,486.465,201.72401]
l 486.465 [0.0,7.98,-7.98,0.0,486.465,205.74594]
y 486.465 [0.0,7.98,-7.98,0.0,486.465,207.96599]
486.465 [0.0,7.98,-7.98,0.0,486.465,211.50592]
R 486.465 [0.0,7.98,-7.98,0.0,486.465,213.48495]
e 486.465 [0.0,7.98,-7.98,0.0,486.465,218.34477]
p 486.465 [0.0,7.98,-7.98,0.0,486.465,221.8847]
o 486.465 [0.0,7.98,-7.98,0.0,486.465,225.90663]
r 486.465 [0.0,7.98,-7.98,0.0,486.465,229.92856]
t 486.465 [0.0,7.98,-7.98,0.0,486.465,233.04794]
我可以肯定地看到getTextMatrix()
为landscape和potrait字符产生不同的矩阵。 (Potrait:[7.787,0.0,0.0,7.7706,406.46,64.9355]
,风景:[0.0,7.7706,-7.787,0.0,94.1495,118.1203]
,风景在矩阵的第一个位置以零开始)。我想更多地理解矩阵,以便我可以清楚地区分两种类型。我提到了documentation,但无法理解。任何人都可以就此分享一些文章或想法吗?