SparkRDD.foreach()无法使用BlobClient将文件上传到Azure Blob存储-身份验证错误

时间:2019-11-07 22:36:04

标签: pyspark azure-blob-storage

我陷入了一个非常奇怪的问题。我有一个火花RDD(键值),想将RDD的每个条目存储到Azure Blob存储上的单独文件中。代码如下:

from azure.storage.blob import (
   BlobServiceClient,
   BlobClient
)

def save_blob(kv):
    CONNECTION_STRING = 'DefaultEndpointsProtocol=https;AccountName=example-account;AccountKey=very-very-secret;EndpointSuffix=core.windows.net'
    CLIENT = 'client'
    INDEX_DIR = f'{CLIENT}/index/'
    CONTAINER_NAME = 'rawdata'

    blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
    blob_client = blob_service_client.get_blob_client(CONTAINER_NAME, f'{INDEX_DIR}{kv[0]}.json')
    return blob_client.upload_blob(kv[1], blob_type="BlockBlob")

我已将其原始字符串中的所有变量都移到save_blob()函数中,以确保在工作程序上不存在变量绝对没有问题。如果我理解正确,则应该在此处设置所有值。

执行以下操作完全正常,将创建一个blob文件。

kv = (0, '{"id": "whatever"}')
save_blob(kv)

但是,一旦我在RDD上运行它,它的数据格式为TUPLE[int, DICT[str, DICT[str, any]]],就运行以下代码:

my_rdd.foreach(lambda kv: save_blob(kv))

给出以下形式的身份验证错误:

azure.storage.blob._generated.models._models_py3.StorageErrorException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
....
azure.core.exceptions.ClientAuthenticationError: Server failed to authenticate the request. 
Make sure the value of Authorization header is formed correctly including the signature.
RequestId:5a6d54c5-701e-00fa-46b8-953d31000000
Time:2019-11-07T22:11:57.3636552Z
ErrorCode:AuthenticationFailed
Error:None
AuthenticationErrorDetail:The MAC signature found in the HTTP request '0p+pmV5/FyugC0RJKnKwn+DwBOMEFXJ6BuGMiVNU38k=' is not the same as any computed signature. Server used following string to sign: 'PUT


971

application/octet-stream



*


x-ms-blob-type:BlockBlob
x-ms-client-request-id:9cffe814-01ab-11ea-aa40-00163e786b5f
x-ms-date:Thu, 07 Nov 2019 22:11:57 GMT
x-ms-version:2019-02-02
/example-container/path/client_20d%2Findex%2F0.json'.

任何人都知道为什么我似乎能够运行代码,只要它不是在RDD中发生的?我很茫然...

谢谢! -汤姆

PS:使用rdd.toDF("column1", "column2")将RDD转换为DataFrame然后再使用df.write.PartitionBy("column1").json(...)不起作用,因为嵌套字典似乎无法正确转换。

0 个答案:

没有答案