我正在将pdfbox与java一起使用,以查找pdf文本的位置或坐标0f。文本在水平方向上的位置是完美的,但垂直文本的位置不正确。这是代码
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
String wordSeparator = getWordSeparator();
List<TextPosition> word = new ArrayList<>();
for (TextPosition text : textPositions) {
String thisChar = text.getUnicode();
pageWid = text.getPageWidth();
pageHe = text.getPageHeight();
if (thisChar != null) {
if (thisChar.length() >= 1) {
if (!thisChar.equals(wordSeparator)) {
word.add(text);
} else if (!word.isEmpty()) {
printWord(word);
word.clear();
}
}
}
}
if (!word.isEmpty()) {
printWord(word);
word.clear();
}
}
void printWord(List<TextPosition> word) {
Rectangle2D boundingBox = null;
StringBuilder builder = new StringBuilder();
for (TextPosition text : word) {
/* int rot =text.getRotation();
System.out.println("rotation is "+rot);*/
Rectangle2D box = new Rectangle2D.Float(text.getX(), text.getY(), text.getWidthDirAdj(),
text.getHeightDir());
if (boundingBox == null)
boundingBox = box;
else
boundingBox.add(box);
builder.append(text.getUnicode());
}
String words = builder.toString().toLowerCase();
System.out.println("the word is" + words + "length is" + words.length());
if (words.length() != 1) {
returncordinates(words, boundingBox.getX(), boundingBox.getY(), boundingBox.getHeight(),
boundingBox.getWidth());
}
}
使用此代码,我可以找到水平文本的坐标:
输出:思考{'x':80.97092447794397,'y':36.56483345224815,'height':1.3265644959675444,'page_no':3,'width':6.51770994998429}