在另一个related question中,我问过如何从本地将文件上传到Microsoft Azure Data Lake Gen 2,并通过REST API向其提供了答案。为了完整起见,建议的代码可以在下面找到。
由于对于大量的相对较小的文件(0.05 MB),这种顺序上传的文件被证明相对较慢,我想问一下是否存在一次对所有文件一次进行批量上传的可能性文件的所有路径都是事先知道的?
使用REST API将单个文件上传到ADLS Gen 2的代码:
import requests
import json
def auth(tenant_id, client_id, client_secret):
print('auth')
auth_headers = {
"Content-Type": "application/x-www-form-urlencoded"
}
auth_body = {
"client_id": client_id,
"client_secret": client_secret,
"scope" : "https://storage.azure.com/.default",
"grant_type" : "client_credentials"
}
resp = requests.post(f"https://login.microsoftonline.com/{tenant_id}/oauth2/v2.0/token", headers=auth_headers, data=auth_body)
return (resp.status_code, json.loads(resp.text))
def mkfs(account_name, fs_name, access_token):
print('mkfs')
fs_headers = {
"Authorization": f"Bearer {access_token}"
}
resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}?resource=filesystem", headers=fs_headers)
return (resp.status_code, resp.text)
def mkdir(account_name, fs_name, dir_name, access_token):
print('mkdir')
dir_headers = {
"Authorization": f"Bearer {access_token}"
}
resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{dir_name}?resource=directory", headers=dir_headers)
return (resp.status_code, resp.text)
def touch_file(account_name, fs_name, dir_name, file_name, access_token):
print('touch_file')
touch_file_headers = {
"Authorization": f"Bearer {access_token}"
}
resp = requests.put(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{dir_name}/{file_name}?resource=file", headers=touch_file_headers)
return (resp.status_code, resp.text)
def append_file(account_name, fs_name, path, content, position, access_token):
print('append_file')
append_file_headers = {
"Authorization": f"Bearer {access_token}",
"Content-Type": "text/plain",
"Content-Length": f"{len(content)}"
}
resp = requests.patch(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{path}?action=append&position={position}", headers=append_file_headers, data=content)
return (resp.status_code, resp.text)
def flush_file(account_name, fs_name, path, position, access_token):
print('flush_file')
flush_file_headers = {
"Authorization": f"Bearer {access_token}"
}
resp = requests.patch(f"https://{account_name}.dfs.core.windows.net/{fs_name}/{path}?action=flush&position={position}", headers=flush_file_headers)
return (resp.status_code, resp.text)
def mkfile(account_name, fs_name, dir_name, file_name, local_file_name, access_token):
print('mkfile')
status_code, result = touch_file(account_name, fs_name, dir_name, file_name, access_token)
if status_code == 201:
with open(local_file_name, 'rb') as local_file:
path = f"{dir_name}/{file_name}"
content = local_file.read()
position = 0
append_file(account_name, fs_name, path, content, position, access_token)
position = len(content)
flush_file(account_name, fs_name, path, position, access_token)
else:
print(result)
if __name__ == '__main__':
tenant_id = '<your tenant id>'
client_id = '<your client id>'
client_secret = '<your client secret>'
account_name = '<your adls account name>'
fs_name = '<your filesystem name>'
dir_name = '<your directory name>'
file_name = '<your file name>'
local_file_name = '<your local file name>'
# Acquire an Access token
auth_status_code, auth_result = auth(tenant_id, client_id, client_secret)
access_token = auth_status_code == 200 and auth_result['access_token'] or ''
print(access_token)
# Create a filesystem
mkfs_status_code, mkfs_result = mkfs(account_name, fs_name, access_token)
print(mkfs_status_code, mkfs_result)
# Create a directory
mkdir_status_code, mkdir_result = mkdir(account_name, fs_name, dir_name, access_token)
print(mkdir_status_code, mkdir_result)
# Create a file from local file
mkfile(account_name, fs_name, dir_name, file_name, local_file_name, access_token)
答案 0 :(得分:1)
到目前为止,将大量文件上传到ADLS gen2的最快方法是使用AzCopy。您可以编写python代码来调用AzCopy。
首先,按照此link下载AzCopy.exe,下载后将其压缩,然后将azcopy.exe复制到一个文件夹(无需安装,它是一个可执行文件),例如{{1 }}
然后从azure门户生成sas令牌,然后复制并保存sas令牌:
假定您已经为adls gen2帐户创建了文件系统,并且不需要手动创建目录,它将由azcopy自动创建。
您需要注意的另一件事是,对于端点,您应该将 dfs 更改为 blob :就像将F:\\azcopy\\v10\\azcopy.exe
更改为https://youraccount.dfs.core.windows.net/
示例代码如下:
https://youraccount.blob.core.windows.net/
测试结果如下,本地目录中的所有文件/子文件夹均上载到ADLS gen2: