Google视觉文档文本检测在检测符号和单词方面做得很好,但是它严格按照行和段落将文本分组在一起,即使这样,在处理具有结构化文本的文档时,有时文本在逻辑上也不合适。
我已经看过API文档,找不到用于提供提示(语言除外)以更改其解析文档方式的示例或参考。一种可能的解决方案是对文档进行预处理,并一次使用Google的api一次处理文档,但更希望直接使用Google的API,而无需中间步骤。
我正在使用的代码直接取自Google的远景pdf python示例,可以使用该代码进行复制,而无需进行任何更改:
https://cloud.google.com/vision/docs/pdf
from google.cloud import storage
import re
def async_detect_document(gcs_source_uri, gcs_destination_uri):
"""OCR with PDF/TIFF as source files on GCS"""
from google.cloud import vision
mime_type = 'application/pdf'
batch_size = 2
client = vision.ImageAnnotatorClient()
feature = vision.types.Feature(
type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)
gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
input_config = vision.types.InputConfig(
gcs_source=gcs_source, mime_type=mime_type)
gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
output_config = vision.types.OutputConfig(
gcs_destination=gcs_destination, batch_size=batch_size)
async_request = vision.types.AsyncAnnotateFileRequest(
features=[feature], input_config=input_config,
output_config=output_config)
operation = client.async_batch_annotate_files(
requests=[async_request])
print('Waiting for the operation to finish.')
operation.result(timeout=180)
storage_client = storage.Client()
match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
bucket_name = match.group(1)
prefix = match.group(2)
bucket = storage_client.get_bucket(bucket_name)
# List objects with the given prefix.
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for blob in blob_list:
print(blob.name)
需要对结果进行不同的分组,从逻辑上看,应将文档的地址分组在一起。这是“显示一个文档的结构(前4行):
-----------------------------------------------------------------
| Active | 2415 ST PETER STREET | $500,000 (LP)
| R2222222 | Port Moody Centre | (SP)
| Board: V, Attached | Port Moody |
| House/Single Family | V3G 2T5 |
-----------------------------------------------------------------
在使用google提供的示例并打印出结果以及页面,块和段落编号后,希望可以显示正在发生的事情。我们可以看到文件顶部的文本在错误的位置,并且上面显示的信息分布在多个段落中,并从第5块开始,应从第0块开始:
********* Page Number: 0*************
********* Block Number: 0*************
********* Paragraph Number: 0*************
MAIN FLOOR
BASEMENT
TOTAL FINISHED AREA
UNFINISHED"
TOTAL AREA
[ snip ]
********* Block Number: 5*************
********* Paragraph Number: 10*************
Active
2415 ST PETER STREET
********* Paragraph Number: 11*************
$500,000 (LP)
R2222222
Port Moody
********* Paragraph Number: 12*************
(SP)
Board: V, Attached
********* Paragraph Number: 13*************
Port Moody Centre
********* Paragraph Number: 15*************
V3G 2T5
********* Paragraph Number: 16*************
[ SNIP ]
上面的输出是使用此生成的:
def annotate(_json):
annotation = _json['responses'][0]['fullTextAnnotation']
line = 0
paranum = 0
blcknum = 0
pgenum = 0
for page in annotation['pages']:
print('********* Page Number: ' + str(pgenum) + '*************')
pgenum += 1
for block in page['blocks']:
print('********* Block Number: ' + str(blcknum) + '*************')
blcknum += 1
for paragraph in block['paragraphs']:
print('********* Paragraph Number: ' + str(paranum) + '*************')
paranum += 1
for word in paragraph['words']:
for symbol in word['symbols']:
print(symbol['text'], end='')
try:
bType = symbol['property']['detectedBreak']['type']
if bType == 'SPACE':
print(' ', end='')
if bType == 'EOL_SURE_SPACE':
print(' ')
if bType == 'LINE_BREAK':
print('')
except KeyError:
print('', end='')
line += 1