有没有一种方法可以指定文档的结构/布局,以便OCR在特定的庄园内处理文档?

时间:2019-06-01 04:56:37

标签: google-cloud-vision

Google视觉文档文本检测在检测符号和单词方面做得很好,但是它严格按照行和段落将文本分组在一起,即使这样,在处理具有结构化文本的文档时,有时文本在逻辑上也不合适。

我已经看过API文档,找不到用于提供提示(语言除外)以更改其解析文档方式的示例或参考。一种可能的解决方案是对文档进行预处理,并一次使用Google的api一次处理文档,但更希望直接使用Google的API,而无需中间步骤。

我正在使用的代码直接取自Google的远景pdf python示例,可以使用该代码进行复制,而无需进行任何更改:

https://cloud.google.com/vision/docs/pdf


from google.cloud import storage
import re

def async_detect_document(gcs_source_uri, gcs_destination_uri):
    """OCR with PDF/TIFF as source files on GCS"""
    from google.cloud import vision
    mime_type = 'application/pdf'
    batch_size = 2
    client = vision.ImageAnnotatorClient()

    feature = vision.types.Feature(
        type=vision.enums.Feature.Type.DOCUMENT_TEXT_DETECTION)

    gcs_source = vision.types.GcsSource(uri=gcs_source_uri)
    input_config = vision.types.InputConfig(
        gcs_source=gcs_source, mime_type=mime_type)

    gcs_destination = vision.types.GcsDestination(uri=gcs_destination_uri)
    output_config = vision.types.OutputConfig(
        gcs_destination=gcs_destination, batch_size=batch_size)

    async_request = vision.types.AsyncAnnotateFileRequest(
        features=[feature], input_config=input_config,
        output_config=output_config)

    operation = client.async_batch_annotate_files(
        requests=[async_request])

    print('Waiting for the operation to finish.')
    operation.result(timeout=180)
    storage_client = storage.Client()

    match = re.match(r'gs://([^/]+)/(.+)', gcs_destination_uri)
    bucket_name = match.group(1)
    prefix = match.group(2)

    bucket = storage_client.get_bucket(bucket_name)

    # List objects with the given prefix.
    blob_list = list(bucket.list_blobs(prefix=prefix))
    print('Output files:')
    for blob in blob_list:
        print(blob.name)

需要对结果进行不同的分组,从逻辑上看,应将文档的地址分组在一起。这是“显示一个文档的结构(前4行):

-----------------------------------------------------------------
| Active              |  2415 ST PETER STREET |  $500,000 (LP) 
| R2222222            |    Port Moody Centre  |           (SP) 
| Board: V, Attached  |       Port Moody      |
| House/Single Family |         V3G 2T5       |
-----------------------------------------------------------------

在使用google提供的示例并打印出结果以及页面,块和段落编号后,希望可以显示正在发生的事情。我们可以看到文件顶部的文本在错误的位置,并且上面显示的信息分布在多个段落中,并从第5块开始,应从第0块开始:

********* Page Number: 0*************
********* Block Number: 0*************
********* Paragraph Number: 0*************
MAIN FLOOR 
BASEMENT 
TOTAL FINISHED AREA 
UNFINISHED" 
TOTAL AREA

[ snip ]

********* Block Number: 5*************
********* Paragraph Number: 10*************
Active 
2415 ST PETER STREET
********* Paragraph Number: 11*************
$500,000 (LP) 
R2222222 
Port Moody
********* Paragraph Number: 12*************
(SP) 
Board: V, Attached
********* Paragraph Number: 13*************
Port Moody Centre 
********* Paragraph Number: 15*************
V3G 2T5 
********* Paragraph Number: 16*************

[ SNIP ]

上面的输出是使用此生成的:

def annotate(_json):
    annotation = _json['responses'][0]['fullTextAnnotation']
    line = 0
    paranum = 0
    blcknum = 0
    pgenum = 0
    for page in annotation['pages']:
        print('********* Page Number: ' + str(pgenum) + '*************')
        pgenum += 1
        for block in page['blocks']:
            print('********* Block Number: ' + str(blcknum) + '*************')
            blcknum += 1
            for paragraph in block['paragraphs']:
                print('********* Paragraph Number: ' + str(paranum) + '*************')
                paranum += 1
                for word in paragraph['words']:
                    for symbol in word['symbols']:
                        print(symbol['text'], end='')
                        try:
                            bType = symbol['property']['detectedBreak']['type']
                            if bType == 'SPACE':
                                print(' ', end='')
                            if bType == 'EOL_SURE_SPACE':
                                print(' ')
                            if bType == 'LINE_BREAK':
                                print('')
                        except KeyError:
                            print('', end='')
                line += 1

0 个答案:

没有答案