有人可以帮我获取pdf begintext部分的真实像素坐标吗? 我正在使用pdfbox从pdf文件中检索文本,但现在我需要获取文本部分/段落的内容。
$contents = $page->getContents();
$contentsStream = $page->getContents()->getStream();
$resources=$page->getResources();
$fonts = $resources->getFonts();
$xobjects = $resources->getImages();
$tokens=$contentsStream->getStreamTokens();
[PDF操作员{q},COSFloat {690.48},COSInt {0},COSInt {0},COSFloat {633.6},COSInt {0},COSInt {0},PDF操作员{cm},COSName {im1 },PDFOperator {Do},PDFOperator {Q},
PDF操作员{BT},COSInt {1},COSInt {0},COSInt {0},COSInt {1},COSFloat {25.92},COSFloat {588.48},PDF操作员{Tm},COSInt {99},PDFOperator {Tz},COSName {F30},COSInt {56},PDF操作员{Tf},COSInt {3},PDF操作员{Tr},COSFloat {0.334},PDF操作员{Tc},COSString {Pospremanj},PDF操作员{Tj},COSInt {0},PDFOperator {Tc},COSString {e},PDFOperator {Tj},COSFloat {9.533},PDFOperator {Tw},COSString {i},PDFOperator {Tj},COSFloat {6.062},PDFOperator {Tw},COSFloat {0.95},PDFOperator {Tc},COSString {ciscenj},PDFOperator {Tj},COSInt {0},PDFOperator {Tc},COSString {e},PDFOperator {Tj},COSInt {1},COSInt {0},COSInt {0},COSInt {1},COSFloat {55.68},COSFloat {539.76},PDF操作员{Tm},COSInt {0},PDF操作员{Tw},COSFloat {0.262},PDF操作员{Tc},COSString {uoè},PDFOperator {Tj},COSInt {0},PDF操作员{Tc},COSString {i},PDF操作员{Tj},COSFloat {5.443},PDF操作员{Tw},COSFloat {-2.145},PDF操作员{Tc}, COSString {zimslco},PDFOperator {Tj},COSInt {0},PDFOperator {Tc},COSString {g},PDFOpera tor {Tj},COSFloat {7.202},PDFOperator {Tw},COSFloat {-0.148},PDFOperator {Tc},COSString {odmor},PDFOperator {Tj},COSInt {0},PDFOperator {Tc},COSString {a} ,PDFOperator {Tj},PDFOperator {ET},
PDF操作员{BT},COSInt {1},COSInt {0},COSInt {0},COSInt {1},COSFloat {6.72},COSFloat {513.12},PDF操作员{Tm},COSInt {0} ,PDFOperator {Tw},COSName {F30},COSInt {14},PDFOperator {Tf},COSString {},PDFOperator {Tj},COSFloat {2.751},PDFOperator {Tw}, ...
我想获得类似PrintTextLocations函数的输出,用于每个单词/字符。 我可以得到底部和左侧坐标,但是如何获得宽度和顶部坐标?
PrintTextLocations:
答案 0 :(得分:1)
...由于BT部分为您提供左下角坐标,您需要解析当前BT区块中包含的所有单词/字母以获得所有其他坐标。 第一个字高度+ BT底部=顶部,最大(左边坐标+宽度)=右边,最后一个字底部=底部坐标。
我希望这有助于某人...
单个字母的示例字符串:
string[32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999]p
提取,解析和准备的行:
32.94,35.099976 fs=8.0 xscale=1.0 height=4.4240003 space=2.2240002 width=3.959999
功能:
/**
* Parse single word / letter element
*
* @param string $str_raw Extracted word string line.
* @param string $str_elem Element of interest, word, char.
* @param int $pdf_w Pdf page width.
* @param int $pdf_h Pdf page height.
* @param int $pdf_d Pdf page dpi.
* @param int $pdf_r Pdf page relative dpi.
*
* @return array
*/
function createRealCoordinates($str_raw, $str_elem, $pdf_w, $pdf_h, $pdf_d = 400, $pdf_r = 72)
{
$stringstrip = array('fs=', 'xscale=', 'height=', 'space=', 'width=');
$string_info = str_replace($stringstrip, '', $str_raw);
$coord_info = explode(' ', $string_info);
$coord_xy = explode(',', $coord_info[0]);
$coord = array(
'pdfWidth' => $pdf_w,
'pdfHeight' => $pdf_h,
'pdfDpi' => $pdf_d,
'pdfRel' => $pdf_r,
'word' => $str_elem,
'x1' => null,
'y1' => null,
'x2' => null,
'y2' => null,
'fontSize' => null,
'xScale' => null,
'HeightDir' => null,
'WidthDir' => null,
'WidthOfSpace' => null,
);
// Left, Bottom coordinate.
$coord['x1'] = ($coord_xy[0] / $pdf_r) * $pdf_d;
$coord['y2'] = ($coord_xy[1] / $pdf_r) * $pdf_d;
$coord['fontSize'] = $coord_info[1]; // font size.
$coord['xScale'] = $coord_info[2]; // x size scale.
$coord['HeightDir'] = $coord_info[3]; // height.
$coord['WidthDir'] = $coord_info[5]; // word width.
$coord['WidthOfSpace'] = ($coord_info[4] / $pdf_r) * $pdf_d; // width of space.
// Right, Top coordinate.
$coord['x2'] = $coord['x1'] + (($coord['WidthDir'] / $pdf_r) * $pdf_d);
$coord['y1'] = $coord['y2'] - (($coord['HeightDir'] / $pdf_r) * $pdf_d);
return $coord;
}
-matija kancijan