Question

总结问题：

我正在尝试将多个PDF处理为用Python编写的OCR程序。在本地开发过程中，PDF文件位于可以处理的本地目录中，但是我无法在Blob存储中弄清楚它是一个类似路径的文件系统。从技术上讲，我知道Blob中没有这样的文件系统，但是我需要在OCR程序中传递这样的路径。有什么办法可以做到这一点？

我尝试过的事情：

目前，我具有下面的代码来连接azure.py中的容器和Blob：

import os
import glob
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, PublicAccess

# list input PDF files 
def ls_files(client, path, recursive=False):
    if not path == '' and not path.endswith('/'):
        path += '/'

    blob_list = client.list_blobs(name_starts_with=path)
    files = []
    for blob in blob_list:
        relative_path = os.path.relpath(blob.name, path) # blob.name is the name of blobs in containers
        if recursive or not '/' in relative_path:
            files.append(relative_path)
            files = [f for f in files if f.endswith('.pdf')] # look for PDF files 
    return files

# connection string to the storage account
connect_str = '<connection string>'
# same container but different folders for inputs and outputs 
container_name = 'ocr'

blob_service_client = BlobServiceClient.from_connection_string(connect_str)
client = blob_service_client.get_container_client(container_name)

input_files = ls_files(client, '', recursive=True) # This is the input PDF files 

for files in input_files:
    ############################
    # kick off OCR program here#
    ############################ 
    print('Processing ...', files, '\n')

在main.py文件中：

import azure as az 

input_directory = az.input_files # input_directory was like '/Users/xyz/path/to/local/dir'

# do regular OCR processing next

执行脚本后，Python无法识别Blob存储中的文件或路径。有没有办法可以在这里实现目标？预先感谢。

编辑1：

我遇到了this sample code，但是我担心这是针对旧版本的Python SDK而不是针对V12的。也一直在寻找官方repo，但无济于事。

编辑2：

好的。打开了一张票证here，以寻求MSFT团队的帮助，一旦我了解更多，它将在这里更新。解决方法是1）下载文件作为内存流或2）在Python中创建一个临时文件以用作占位符。欢迎任何建议。

Answer 1

除了可以使用Azure存储文件共享，还可以使用用于OCR的Azure认知服务计算机视觉API，而不是使用Azure Storage BLOB https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text

有没有一种方法可以处理Blob存储中的PDF文件，而无需使用Python在本地下载它们？

1 个答案: