我正在尝试为PDF文件创建自己的“表格提取”功能,可以在PDF文档顶部定义“类似于表格的”列,然后以表格格式提取提取的文本。
我在下面的列中定义了页面上的30%和60%:
这些列的提供方式如下:
{"1":{"position":"33"},"2":{"position":"60"}}
下面是我的Python方法,该方法读取PDF文件,拆分每一列并提取文本。
def convertPDFTextToTableData(pdf_file, save_dir, COLUMNS):
#Get the width/height of the PDF file.
dimensions = PdfFileReader(open(pdf_file, 'rb'))
dimensions = dimensions.getPage(0).mediaBox
width = float(dimensions[2])
height = float(dimensions[3])
col = COLUMNS[str(1)]
# Use ghostscript to get the number of pages
os.system(GHOSTSCRIPT + ' -dBATCH -q -dNODISPLAY -c "("' +
pdf_file + '") (r) file runpdfbegin pdfpagecount = quit" >tmp.tmp')
with open("tmp.tmp", "r") as f:
npages = int(f.read())
os.remove("tmp.tmp")
column = defaultdict(list)
firstWidth = 0
for i, col in enumerate(COLUMNS):
col = COLUMNS.get(str(col))
pixelsrightcorner = round(
width*(float(COLUMNS[str(i + 1)]['position']) / 100)-firstWidth)
area = (firstWidth, 0, pixelsrightcorner, int(height))
nextWidth = pixelsrightcorner
cmd = ['pdftotext', '-f', str(1), '-l', str(npages), '-x', str(area[0]), '-y', str(area[1]),
'-W', str(area[2]), '-H', str(area[3]), str(pdf_file), '-layout', '-']
proc = subprocess.Popen(
cmd, stdout=subprocess.PIPE, bufsize=0, text=True)
out, err = proc.communicate()
for line in out.splitlines():
line = str(line)
column[i + 1].append({"row": str(line)})
firstWidth = round(pixelsrightcorner + firstWidth)
# Last column (rest of the page)
lastColumn = int(len(COLUMNS))
pixelsrightcorner = (
(100 - float(COLUMNS[str(i + 1)]['position']))/100)*width
area = (firstWidth, 0, int(pixelsrightcorner), int(height))
cmd = ['pdftotext', '-f', str(1), '-l', str(npages), '-x', str(area[0]), '-y', str(area[1]),
'-W', str(area[2]), '-H', str(area[3]), str(pdf_file), '-layout', '-']
proc = subprocess.Popen(
cmd, stdout=subprocess.PIPE, bufsize=0, text=True)
out, err = proc.communicate()
for line in out.splitlines():
column[lastColumn + 1].append({"row": str(line)})
# Ensure that all arrays are the same length.
# Fill up with " row:"" " for shorther arrays.
longest_array = (max(map(len, column.values())))
final = {k: v + [{'row': ''}] * (longest_array - len(v))
for k, v in column.items()}
# Create JSON file.
f = open(save_dir + 'table_data.json', "w+")
f.write(json.dumps(final))
f.close()
下面的代码输出。如您所见,文本提取按应有的方式工作。但是,提取的格式没有。如您所见,行格式不正确。
下面是JSON字符串,实际上是在table_data.json
中输出的。请注意,我尝试将其最小化,因为它很长。
{
"1":[
{
"row":"Commercial Invoice #1200"
},
[...]
{
"row":"www.domain.com"
},
[...]
{
"row":" Faucets PO#900"
},
{
"row":""
},
{
"row":" Faucets PO#901"
},
[...]
{
"row":" Total"
},
{
"row":""
}
],
"2":[
{
"row":"Your Invoice: TMS"
},
{
"row":""
},
[...]
],
"3":[
{
"row":" Date: 25\/06 \u2013 2019"
},
[...]
{
"row":"USD 900"
},
{
"row":""
},
{
"row":"USD 100"
},
[...]
{
"row":"USD 1000"
},
[...]
]
}
我认为pdftotext
命令的 坐标(-x -y -W -H)会导致布局/格式出现一些问题-因为它仅查看特定的页面的一部分,并将其解释为整个PDF文件。这导致布局(线)发生偏移。
关于如何解决此问题的任何想法?我知道pdftotext
也可以提供-bbox-layout
,但是此选项只能提取整个PDF文件,而不能提取文件的特定部分。