Question

Google视觉文档文本检测在检测符号和单词方面做得很好，但是它严格按照行和段落将文本分组在一起，即使这样，在处理具有结构化文本的文档时，有时文本在逻辑上也不合适。

我已经看过API文档，找不到用于提供提示（语言除外）以更改其解析文档方式的示例或参考。一种可能的解决方案是对文档进行预处理，并一次使用Google的api一次处理文档，但更希望直接使用Google的API，而无需中间步骤。

我正在使用的代码直接取自Google的远景pdf python示例，可以使用该代码进行复制，而无需进行任何更改：

https://cloud.google.com/vision/docs/pdf

from google.cloud import storage
import re

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    from google.cloud import vision
    mime_type = 'application/pdf'
    batch_size = 2
    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
    input_config = vision.types.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.types.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.types.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=180)
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

需要对结果进行不同的分组，从逻辑上看，应将文档的地址分组在一起。这是“显示一个文档的结构（前4行）：

-----------------------------------------------------------------
| Active              |  2415 ST PETER STREET |  $500,000 (LP) 
| R2222222            |    Port Moody Centre  |           (SP) 
| Board: V, Attached  |       Port Moody      |
| House/Single Family |         V3G 2T5       |
-----------------------------------------------------------------

在使用google提供的示例并打印出结果以及页面，块和段落编号后，希望可以显示正在发生的事情。我们可以看到文件顶部的文本在错误的位置，并且上面显示的信息分布在多个段落中，并从第5块开始，应从第0块开始：

********* Page Number: 0*************
********* Block Number: 0*************
********* Paragraph Number: 0*************
MAIN FLOOR 
BASEMENT 
TOTAL FINISHED AREA 
UNFINISHED" 
TOTAL AREA

[ snip ]

********* Block Number: 5*************
********* Paragraph Number: 10*************
Active 
2415 ST PETER STREET
********* Paragraph Number: 11*************
$500,000 (LP) 
R2222222 
Port Moody
********* Paragraph Number: 12*************
(SP) 
Board: V, Attached
********* Paragraph Number: 13*************
Port Moody Centre 
********* Paragraph Number: 15*************
V3G 2T5 
********* Paragraph Number: 16*************

[ SNIP ]

上面的输出是使用此生成的：

def annotate(_json):
    annotation = _json['responses'][0]['fullTextAnnotation']
    line = 0
    paranum = 0
    blcknum = 0
    pgenum = 0
    for page in annotation['pages']:
        print('********* Page Number: ' + str(pgenum) + '*************')
        pgenum += 1
        for block in page['blocks']:
            print('********* Block Number: ' + str(blcknum) + '*************')
            blcknum += 1
            for paragraph in block['paragraphs']:
                print('********* Paragraph Number: ' + str(paranum) + '*************')
                paranum += 1
                for word in paragraph['words']:
                    for symbol in word['symbols']:
                        print(symbol['text'], end='')
                        try:
                            bType = symbol['property']['detectedBreak']['type']
                            if bType == 'SPACE':
                                print(' ', end='')
                            if bType == 'EOL_SURE_SPACE':
                                print(' ')
                            if bType == 'LINE_BREAK':
                                print('')
                        except KeyError:
                            print('', end='')
                line += 1

有没有一种方法可以指定文档的结构/布局，以便OCR在特定的庄园内处理文档？

0 个答案: