Question

我想从一个天蓝色的 blob 中读取一堆小文件，这可能是 1k-100k 文件，总共只有几个 1TB。我必须在 python 中处理这些文件，它本身的处理并不繁重，但是从 blob 中读取文件确实需要时间。对此的另一个限制是，在我处理第一个文件时写入了新文件。

我正在寻找执行此操作的选项，是否可以使用 dask 从 blob 中并行读取多个文件？或者是否有可能在 azure 网络中每小时传输和加载超过 1tb？

Answer 1

好吧，您有几个选项可以在这里实现并行性：

多线程：

下面使用 Python 中的 ThreadPool 类从 Azure 存储并行下载和处理文件。注意：使用 v12 storage sdk

import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient

STORAGE_CONNECTION_STRING = "REPLACE_THIS"
BLOB_CONTAINER = "myfiles"

class AzureBlobProcessor:
  def __init__(self): 
    # Initialize client
    self.blob_service_client =  BlobServiceClient.from_connection_string(STORAGE_CONNECTION_STRING)
    self.blob_container = self.blob_service_client.get_container_client(BLOB_CONTAINER)
 
  def process_all_blobs_in_container(self):
    # get a list of blobs
    blobs = self.blob_container.list_blobs()
    result = self.execute(blobs)
 
  def execute(self, blobs):
    # Just sample number of threads as 10
    with ThreadPool(processes=int(10)) as pool:
     return pool.map(self.download_and_process_blob, blobs)
 
  def download_and_process_blob(self,blob):
    file_name = blob.name
    
    # below is just sample which reads bytes, update to variant you need
    bytes = self.blob_container.get_blob_client(blob).download_blob().readall()
 
    # processing logic goes here :)

    return file_name
 
# caller code
azure_blob_processor = AzureBlobProcessor()
azure_blob_processor.process_all_blobs_in_container()

您也可以查看dask remote data read。检查https://github.com/dask/adlfs

要使用 Gen1 文件系统：

import dask.dataframe as dd

storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)

要使用 Gen2 文件系统，您可以使用协议 abfs 或 az：

import dask.dataframe as dd

storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}

ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)

要从公共存储 blob 中读取，您需要指定 'account_name'。例如，您可以将 NYC Taxi & Limousine Commission 访问为：

storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)

利用 Azure PaaS 实现并行性：

好吧，您在此路径中有多种选择。

Azure 批次：Tutorial: Run a parallel workload with Azure Batch using the Python API
带有 Blob 事件网格触发器的 Azure 函数：Azure Event Grid bindings for Azure Functions

最后，我建议您深入了解 Performance and scalability checklist for Blob storage，以确保您处于 Azure 存储帐户数据传输的限制范围内。还有Scalability and performance targets for standard storage accounts。看看你每小时 1 tb 的要求，如果你从上面的文档转换 gbps，它似乎受到限制。

从天蓝色的 blob 中并行读取多个文件

1 个答案:

多线程：

利用 Azure PaaS 实现并行性：