我有一个XML文件,其中包含一些PDF文件中的边界框信息:
<page bbox="0.000,0.000,612.000,792.000" rotate="0">
<textbox bbox="21.600,733.350,65.644,751.641">
<textline bbox="21.600,733.350,65.644,751.641">
<text bbox="21.600,733.350,30.258,751.641">L</text>
<text bbox="30.258,733.350,37.486,751.641">i</text>
<text bbox="37.486,733.350,44.714,751.641">n</text>
<text bbox="44.714,733.350,48.315,751.641">e</text>
<text bbox="48.315,733.350,55.543,751.641">#</text>
<text bbox="55.543,733.350,62.043,751.641">1</text>
</textline>
</textbox>
<textbox bbox="21.600,714.720,140.775,729.494">
<textline bbox="21.600,714.720,140.775,729.494">
<text bbox="27.438,714.720,29.769,729.494"ncolour="1"size="14.774">L</text>
<text bbox="21.600,714.720,27.438,729.494"ncolour="1"size="14.774">i</text>
<text bbox="29.769,714.720,35.019,729.494"ncolour="1"size="14.774">n</text>
<text bbox="35.019,714.720,40.857,729.494"ncolour="1"size="14.774">e</text>
<text bbox="40.857,714.720,43.188,729.494"ncolour="1"size="14.774">#</text>
<text bbox="43.188,714.720,49.026,729.494"ncolour="1"size="14.774">2</text>
</textline>
</textbox>
<textbox bbox="223.560,717.899,457.560,754.481">
<textline bbox="223.560,717.899,457.560,754.481">
<text font="EAAAAA+ArialUnicodeMS" bbox="223.560,717.899,242.332,754.481" colourspace="DeviceGray" ncolour="0.098" size="36.582">L</text>
<text font="EAAAAA+ArialUnicodeMS" bbox="242.332,717.899,248.104,754.481" colourspace="DeviceGray" ncolour="0.098" size="36.582">i</text>
<text font="EAAAAA+ArialUnicodeMS" bbox="248.104,717.899,261.104,754.481" colourspace="DeviceGray" ncolour="0.098" size="36.582">n</text>
<text font="EAAAAA+ArialUnicodeMS" bbox="261.104,717.899,275.560,754.481" colourspace="DeviceGray" ncolour="0.098" size="36.582">e</text>
<text font="EAAAAA+ArialUnicodeMS" bbox="275.560,717.899,281.332,754.481" colourspace="DeviceGray" ncolour="0.098" size="36.582">#</text>
<text font="EAAAAA+ArialUnicodeMS" bbox="281.332,717.899,295.788,754.481" colourspace="DeviceGray" ncolour="0.098" size="36.582">1</text>
</textline>
</textbox>
</page>
上面只是简单地说明了
Line #1
Line #2
Line #1
我正在尝试将PDF文件解析为文本并保持格式。
为此,我想我需要使用bbox
信息以某种方式计算页面上的位置(textline
标签所在的行号)。
我有页面的宽度和高度。在这种情况下,它是612.000,792.000
任何人都可以指导我如何解析此问题。到目前为止,我只能解析每一行并附加每个字符来拼写每个单词-但随后每个<textbox>
都被视为单独的一行:
tree = ET.parse(path_to_xml_file)
root = tree.getroot()
lines = {}
textboxes = root.findall('./page/textbox')
for textbox in textboxes:
for line in words:
for char in line.text:
[...]
如何解析bbox
信息并计算<textline>
标签所在的行号?