我需要从许多PDF文档中提取大量页面上的表格数据。使用Adobe Acrobat Reader中的内置文本导出功能是无用的 - 以这种方式提取的文本会丢失表格建立的空间关系。其他人提出了许多问题,并且我尝试过这个问题的许多解决方案,但结果在穷人和可怕之间变化。所以我开始着手开发自己的解决方案。它的开发足够(我认为)它已准备好在这里分享。
答案 0 :(得分:1)
我首先尝试查看文本的分布(根据页面上的x和y位置)来尝试识别行和列中断的位置。通过使用Python模块'pdfminer',我提取了文本和BoundingBox参数,筛选了每段文本,并映射了给定x或y值的页面上有多少文本。我们的想法是查看文本的分布(横向用于行中断,垂直用于列中断),当密度为零时(意味着表中有明显的间隙,或向上/向下),这将确定行或列中断。
这个想法确实有效,但有时候才有效。它假定表格具有相同的数字和垂直和水平的单元格对齐(简单的网格),并且相邻单元格的文本之间存在明显的间隙。此外,如果文本跨越多个列(如表格上方的标题,表格下方的页脚,合并单元格等),则列标记的识别更加困难 - 您可能能够识别上面的文本元素或应忽略表格下方,但我找不到处理合并单元格的好方法。
当需要水平查找划线时,还有其他一些挑战。首先,pdfminer会自动尝试对彼此靠近的文本进行分组,即使它们跨越表中的一个单元格也是如此。在这些情况下,该文本对象的BoundingBox包含多行,模糊了可能已经越过的任何行中断。即使每行文本都是单独提取的,但挑战在于区分分隔连续文本行的正常空间和行间隔。
在探索各种解决方法并进行多项测试后,我决定退一步尝试另一种方法。
具有我需要提取的数据的表都有它们周围的边界,所以我推断我应该能够在PDF文件中找到绘制这些行的元素。但是,当我查看可以从源文件中提取的元素时,我得到了一些令人惊讶的结果。
你会认为这些行会被表示为“行对象”,但你错了(至少对于我正在查看的文件)。如果它们不是“线”,那么它们可能只是为每个单元格绘制矩形,调整线宽属性以获得所需的线条粗细,对吧?不,事实证明,这些线条实际上被绘制为具有非常小尺寸的“矩形物体”(窄宽度以创建垂直线,或者短高度以创建水平线)。看起来线条在角落处相遇,矩形没有 - 它们有一个非常小的矩形来填补空隙。
一旦我能够识别要寻找什么,我就不得不与彼此相邻放置的多个矩形竞争以创建粗线。最后,我编写了一个例程来对类似的值进行分组,并计算一个平均值,用于稍后我将使用的行和列中断。
现在,这是处理表中文本的问题。我选择使用SQLite数据库来存储,分析和重新组合PDF文件中的文本。我知道还有其他“pythonic”选项,有些人可能会发现这些方法更加熟悉和易于使用,但我觉得我将要处理的数据量最好使用实际的数据库文件来处理。
正如我之前提到的,pdfminer将文本放在彼此靠近的位置,并且它可能跨越单元格边界。在这些文本组之一的单独行上分割显示的文本的初步尝试仅部分成功;这是我打算进一步发展的领域之一(即,如何绕过pdfminer LTTextbox例程,以便我可以单独获得这些部分)。
当涉及垂直文本时,pdfminer模块还有另一个缺点。我无法识别任何属性,这些属性将识别文本何时垂直,或者显示文本的角度(例如,+ 90或-90度)。并且文本分组例程似乎也不知道,因为文本旋转+90度(即旋转CCW,其中字母从下往上读取),它以相反的顺序连接字母,由换行符分隔。
在这种情况下,下面的例程工作得相当好。我知道它仍然很粗糙,有一些增强功能,并没有以一种已经准备好进行广泛分发的方式打包,但似乎已经“破解了代码”如何从PDF文件中提取表格数据(对于大部分)。希望其他人可以将它用于自己的目的,甚至可以改进它。
我欢迎您提出任何想法,建议或建议。
编辑:我发布了一个修订版本,其中包含其他参数(cell_htol_up等)以帮助调整"关于哪些文本属于表中特定单元格的算法。
# This was written for use w/Python 2. Use w/Python 3 hasn't been tested & proper execution is not guaranteed.
import os # Library of Operating System routines
import sys # Library of System routines
import sqlite3 # Library of SQLite dB routines
import re # Library for Regular Expressions
import csv # Library to output as Comma Separated Values
import codecs # Library of text Codec types
import cStringIO # Library of String manipulation routines
from pdfminer.pdfparser import PDFParser # Library of PDF text extraction routines
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage, LTLine, LTRect, LTTextBoxVertical
from pdfminer.converter import PDFPageAggregator
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
def add_new_value (new_value, list_values=[]):
# Used to exclude duplicate values in a list
not_in_list = True
for list_value in list_values:
# if list_value == new_value:
if abs(list_value - new_value) < 1:
not_in_list = False
if not_in_list:
list_values.append(new_value)
return list_values
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
def condense_list (list_values, grp_tolerance = 1):
# Group values & eliminate duplicate/close values
tmp_list = []
for n, list_value in enumerate(list_values):
if sum(1 for val in tmp_list if abs(val - list_values[n]) < grp_tolerance) == 0:
tmp_val = sum(list_values[n] for val in list_values if abs(val - list_values[n]) < grp_tolerance) / \
sum(1 for val in list_values if abs(val - list_values[n]) < grp_tolerance)
tmp_list.append(int(round(tmp_val)))
return tmp_list
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, quotechar = '"', quoting=csv.QUOTE_ALL, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([unicode(s).encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# In case a connection to the database can't be created, set 'conn' to 'None'
conn = None
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Define variables for use later on
#_______________________________________________________________________________________________________________________
sqlite_file = "pdf_table_text.sqlite" # Name of the sqlite database file
brk_tol = 3 # Tolerance for grouping LTRect values as line break points
# *** This may require tuning to get optimal results ***
cell_htol_lf = -2 # Horizontal & Vertical tolerances (up/down/left/right)
cell_htol_rt = 2 # for over-scanning table cell bounding boxes
cell_vtol_up = 8 # i.e., how far outside cell bounds to look for text to include
cell_vtol_dn = 0 # *** This may require tuning to get optimal results ***
replace_newlines = True # Switch for replacing newline codes (\n) with spaces
replace_multspaces = True # Switch for replacing multiple spaces with a single space
# txt_concat_str = "' '" # Concatenate cell data with a single space
txt_concat_str = "char(10)" # Concatenate cell data with a line feed
#=======================================================================================================================
# Default values for sample input & output files (path, filename, pagelist, etc.)
filepath = "" # Path of the source PDF file (default = current folder)
srcfile = "" # Name of the source PDF file (quit if left blank)
pagelist = [1, ] # Pages to extract table data (Make an interactive input?)
# --> THIS MUST BE IN THE FORM OF A LIST OR TUPLE!
#=======================================================================================================================
# Impose required conditions & abort execution if they're not met
# Should check if files are locked: sqlite database, input & output files, etc.
if filepath + srcfile == "" or pagelist == None:
print "Source file not specified and/or page list is blank! Execution aborted!"
sys.exit()
dmp_pdf_data = "pdf_data.csv"
dmp_tbl_data = "tbl_data.csv"
destfile = srcfile[:-3]+"csv"
#=======================================================================================================================
# First test to see if this file already exists & delete it if it does
if os.path.isfile(sqlite_file):
os.remove(sqlite_file)
#=======================================================================================================================
try:
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Open or Create the SQLite database file
#___________________________________________________________________________________________________________________
print "-" * 120
print "Creating SQLite Database & working tables ..."
# Connecting to the database file
conn = sqlite3.connect(sqlite_file)
curs = conn.cursor()
qry_create_table = "CREATE TABLE {tn} ({nf} {ft} PRIMARY KEY)"
qry_alter_add_column = "ALTER TABLE {0} ADD COLUMN {1}"
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Create 1st Table
#___________________________________________________________________________________________________________________
tbl_pdf_elements = "tbl_pdf_elements" # Name of the 1st table to be created
new_field = "idx" # Name of the index column
field_type = "INTEGER" # Column data type
# Delete the table if it exists so old data is cleared out
curs.execute("DROP TABLE IF EXISTS " + tbl_pdf_elements)
# Create output table for PDF text w/1 column (index) & set it as PRIMARY KEY
curs.execute(qry_create_table.format(tn=tbl_pdf_elements, nf=new_field, ft=field_type))
# Table fields: index, text_string, pg, x0, y0, x1, y1, orient
cols = ("'pdf_text' TEXT",
"'pg' INTEGER",
"'x0' INTEGER",
"'y0' INTEGER",
"'x1' INTEGER",
"'y1' INTEGER",
"'orient' INTEGER")
# Add other columns
for col in cols:
curs.execute(qry_alter_add_column.format(tbl_pdf_elements, col))
# Committing changes to the database file
conn.commit()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Create 2nd Table
#___________________________________________________________________________________________________________________
tbl_table_data = "tbl_table_data" # Name of the 2nd table to be created
new_field = "idx" # Name of the index column
field_type = "INTEGER" # Column data type
# Delete the table if it exists so old data is cleared out
curs.execute("DROP TABLE IF EXISTS " + tbl_table_data)
# Create output table for Table Data w/1 column (index) & set it as PRIMARY KEY
curs.execute(qry_create_table.format(tn=tbl_table_data, nf=new_field, ft=field_type))
# Table fields: index, text_string, pg, row, column
cols = ("'tbl_text' TEXT",
"'pg' INTEGER",
"'row' INTEGER",
"'col' INTEGER")
# Add other columns
for col in cols:
curs.execute(qry_alter_add_column.format(tbl_table_data, col))
# Committing changes to the database file
conn.commit()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Start PDF text extraction code here
#___________________________________________________________________________________________________________________
print "Opening PDF file & preparing for text extraction:"
print " -- " + filepath + srcfile
# Open a PDF file.
fp = open(filepath + srcfile, "rb")
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization (if needed)
# document = PDFDocument(parser, password)
document = PDFDocument(parser)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Extract text & location data from PDF file (examine & process only pages in the page list)
#___________________________________________________________________________________________________________________
# Initialize variables
idx1 = 0
idx2 = 0
lastpg = max(pagelist)
print "Starting text extraction ..."
qry_insert_pdf_txt = "INSERT INTO " + tbl_pdf_elements + " VALUES(?, ?, ?, ?, ?, ?, ?, ?)"
qry_get_pdf_txt = "SELECT group_concat(pdf_text, " + txt_concat_str + \
") FROM {0} WHERE pg=={1} AND x0>={2} AND x1<={3} AND y0>={4} AND y1<={5} ORDER BY y0 DESC, x0 ASC;"
qry_insert_tbl_data = "INSERT INTO " + tbl_table_data + " VALUES(?, ?, ?, ?, ?)"
# Process each page contained in the document.
for i, page in enumerate(PDFPage.create_pages(document)):
interpreter.process_page(page)
# Get the LTPage object for the page.
lt_objs = device.get_result()
pg = device.pageno - 1 # Must subtract 1 to correct 'pageno'
# Exit the loop if past last page to parse
if pg > lastpg:
break
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# If it finds a page in the pagelist, process the contents
if pg in pagelist:
print "- Processing page {0} ...".format(pg)
xbreaks = []
ybreaks = []
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Iterate thru list of pdf layout elements (LT* objects) then capture the text & attributes of each
for lt_obj in lt_objs:
# Examine LT objects & get parameters for text strings
if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
# Increment index
idx1 += 1
# Assign PDF LTText object parameters to variables
pdftext = lt_obj.get_text() # Need to convert escape codes & unicode characters!
pdftext = pdftext.strip() # Remove leading & trailing whitespaces
# Save integer bounding box coordinates: round down @ start, round up @ end
# (x0, y0, x1, y1) = lt_obj.bbox
x0 = int(lt_obj.bbox[0])
y0 = int(lt_obj.bbox[1])
x1 = int(lt_obj.bbox[2] + 1)
y1 = int(lt_obj.bbox[3] + 1)
orient = 0 # What attribute gets this value?
#---- These approaches don't work for identifying vertical text ... --------------------------------
# orient = lt_obj.rotate
# orient = lt_obj.char_disp
# if lt_obj.get_writing_mode == "tb-rl":
# orient = 90
# if isinstance(lt_obj, LTTextBoxVertical): # vs LTTextBoxHorizontal
# orient = 90
# if LAParams(lt_obj).detect_vertical:
# orient = 90
#---------------------------------------------------------------------------------------------------
# Split text strings at line feeds
if "\n" in pdftext:
substrs = pdftext.split("\n")
lineheight = (y1-y0) / (len(substrs) + 1)
# y1 = y0 + lineheight
y0 = y1 - lineheight
for substr in substrs:
substr = substr.strip() # Remove leading & trailing whitespaces
if substr != "":
# Insert values into tuple for uploading into dB
pdf_txt_export = [(idx1, substr, pg, x0, y0, x1, y1, orient)]
# Insert values into dB
curs.executemany(qry_insert_pdf_txt, pdf_txt_export)
conn.commit()
idx1 += 1
# y0 = y1
# y1 = y0 + lineheight
y1 = y0
y0 = y1 - lineheight
else:
# Insert values into tuple for uploading into dB
pdf_txt_export = [(idx1, pdftext, pg, x0, y0, x1, y1, orient)]
# Insert values into dB
curs.executemany(qry_insert_pdf_txt, pdf_txt_export)
conn.commit()
elif isinstance(lt_obj, LTLine):
# LTLine - Lines drawn to define tables
pass
elif isinstance(lt_obj, LTRect):
# LTRect - Borders drawn to define tables
# Grab the lt_obj.bbox values
x0 = round(lt_obj.bbox[0], 2)
y0 = round(lt_obj.bbox[1], 2)
x1 = round(lt_obj.bbox[2], 2)
y1 = round(lt_obj.bbox[3], 2)
xmid = round((x0 + x1) / 2, 2)
ymid = round((y0 + y1) / 2, 2)
# rectline = lt_obj.linewidth
# If width less than tolerance, assume it's used as a vertical line
if (x1 - x0) < brk_tol: # Vertical Line or Corner
xbreaks = add_new_value(xmid, xbreaks)
# If height less than tolerance, assume it's used as a horizontal line
if (y1 - y0) < brk_tol: # Horizontal Line or Corner
ybreaks = add_new_value(ymid, ybreaks)
elif isinstance(lt_obj, LTImage):
# An image, so do nothing
pass
elif isinstance(lt_obj, LTFigure):
# LTFigure objects are containers for other LT* objects which shouldn't matter, so do nothing
pass
col_breaks = condense_list(xbreaks, brk_tol) # Group similar values & eliminate duplicates
row_breaks = condense_list(ybreaks, brk_tol)
col_breaks.sort()
row_breaks.sort()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Regroup the text into table 'cells'
#___________________________________________________________________________________________________________
print " -- Text extraction complete. Grouping data for table ..."
row_break_prev = 0
col_break_prev = 0
table_data = []
table_rows = len(row_breaks)
for i, row_break in enumerate(row_breaks):
if row_break_prev == 0: # Skip the rest the first time thru
row_break_prev = row_break
else:
for j, col_break in enumerate(col_breaks):
if col_break_prev == 0: # Skip query the first time thru
col_break_prev = col_break
else:
# Run query to get all text within cell lines (+/- htol & vtol values)
curs.execute(qry_get_pdf_txt.format(tbl_pdf_elements, pg, col_break_prev + cell_htol_lf, \
col_break + cell_htol_rt, row_break_prev + cell_vtol_dn, row_break + cell_vtol_up))
rows = curs.fetchall() # Retrieve all rows
for row in rows:
if row[0] != None: # Skip null results
idx2 += 1
table_text = row[0]
if replace_newlines: # Option - Replace newline codes (\n) with spaces
table_text = table_text.replace("\n", " ")
if replace_multspaces: # Option - Replace multiple spaces w/single space
table_text = re.sub(" +", " ", table_text)
table_data.append([idx2, table_text, pg, table_rows - i, j])
col_break_prev = col_break
row_break_prev = row_break
curs.executemany(qry_insert_tbl_data, table_data)
conn.commit()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Export the regrouped table data:
# Determine the number of columns needed for the output file
# -- Should the data be extracted all at once or one page at a time?
print "Saving exported table data ..."
qry_col_count = "SELECT MIN([col]) AS colmin, MAX([col]) AS colmax, MIN([row]) AS rowmin, MAX([row]) AS rowmax, " + \
"COUNT([row]) AS rowttl FROM [{0}] WHERE [pg] = {1} AND [tbl_text]!=' ';"
qry_sql_export = "SELECT * FROM [{0}] WHERE [pg] = {1} AND [row] = {2} AND [tbl_text]!=' ' ORDER BY [col];"
f = open(filepath + destfile, "wb")
writer = UnicodeWriter(f)
for pg in pagelist:
curs.execute(qry_col_count.format(tbl_table_data, pg))
rows = curs.fetchall()
if len(rows) > 1:
print "Error retrieving row & column counts! More that one record returned!"
print " -- ", qry_col_count.format(tbl_table_data, pg)
print rows
sys.exit()
for row in rows:
(col_min, col_max, row_min, row_max, row_ttl) = row
# Insert a page separator
writer.writerow(["Data for Page {0}:".format(pg), ])
if row_ttl == 0:
writer.writerow(["Unable to export text from PDF file. No table structure found.", ])
else:
k = 0
for j in range(row_min, row_max + 1):
curs.execute(qry_sql_export.format(tbl_table_data, pg, j))
rows = curs.fetchall()
if rows == None: # No records match the given criteria
pass
else:
i = 1
k += 1
column_data = [k, ] # 1st column as an Index
for row in rows:
(idx, tbl_text, pg_num, row_num, col_num) = row
if pg_num != pg: # Exit the loop if Page # doesn't match
break
while i < col_num:
column_data.append("")
i += 1
if i >= col_num or i == col_max: break
column_data.append(unicode(tbl_text))
i += 1
writer.writerow(column_data)
f.close()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Dump the SQLite regrouped data (for error checking):
print "Dumping SQLite table of regrouped (table) text ..."
qry_sql_export = "SELECT * FROM [{0}] WHERE [tbl_text]!=' ' ORDER BY [pg], [row], [col];"
curs.execute(qry_sql_export.format(tbl_table_data))
rows = curs.fetchall()
# Output data with Unicode intact as CSV
with open(dmp_tbl_data, "wb") as f:
writer = UnicodeWriter(f)
writer.writerow(["idx", "tbl_text", "pg", "row", "col"])
writer.writerows(rows)
f.close()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Dump the SQLite temporary PDF text data (for error checking):
print "Dumping SQLite table of extracted PDF text ..."
qry_sql_export = "SELECT * FROM [{0}] WHERE [pdf_text]!=' ' ORDER BY pg, y0 DESC, x0 ASC;"
curs.execute(qry_sql_export.format(tbl_pdf_elements))
rows = curs.fetchall()
# Output data with Unicode intact as CSV
with open(dmp_pdf_data, "wb") as f:
writer = UnicodeWriter(f)
writer.writerow(["idx", "pdf_text", "pg", "x0", "y0", "x1", "y2", "orient"])
writer.writerows(rows)
f.close()
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print "Conversion complete."
print "-" * 120
except sqlite3.Error, e:
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Rollback the last database transaction if the connection fails
#___________________________________________________________________________________________________________________
if conn:
conn.rollback()
print "Error '{0}':".format(e.args[0])
sys.exit(1)
finally:
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Close the connection to the database file
#___________________________________________________________________________________________________________________
if conn:
conn.close()