在VS Code中的python脚本中调用ADLS数据

时间:2019-03-19 15:46:04

标签: python azure visual-studio-code azure-data-lake

我已经用VS代码安装了ADL扩展,现在我正在编写一个Python脚本,在这里我需要读取Azure Data Lake Storage(ADLS Gen1)中存在的csv文件。 对于本地文件,以下代码有效:

df = pd.read_csv(Path('C:\\Users\\Documents\\breslow.csv')) 
print (df)

我如何从ADLS读取数据? 即使成功安装和连接了ADL扩展(使用我的Azure帐户),我仍然需要创建作用域和秘密以及所有内容吗?

2 个答案:

答案 0 :(得分:0)

这是从ADLS读取csv文件的示例代码。

# -*- coding: utf-8 -*-
"""
Created on Wed Mar 20 11:37:19 2019

@author: Mohit Verma
"""

from azure.datalake.store import core, lib, multithread
token = lib.auth(tenant_id, username, password)
adl = core.AzureDLFileSystem(token, store_name=store_name)

# typical operations
adl.ls('')
adl.ls('tmp/', detail=True)
adl.ls('tmp/', detail=True, invalidate_cache=True)
adl.cat('littlefile')
adl.head('gdelt20150827.csv')

# file-like object
with adl.open('gdelt20150827.csv', blocksize=2**20) as f:
    print(f.readline())
    print(f.readline())
    print(f.readline())
    # could have passed f to any function requiring a file object:
    # pandas.read_csv(f)

with adl.open('anewfile', 'wb') as f:
    # data is written on flush/close, or when buffer is bigger than
    # blocksize
    f.write(b'important data')

adl.du('anewfile')

# recursively download the whole directory tree with 5 threads and
# 16MB chunks
multithread.ADLDownloader(adl, "", 'my_temp_dir', 5, 2**24)

请尝试使用此代码,看看是否有帮助。有关Azure Data Lake的其他示例,请参阅下面的github存储库。

https://github.com/Azure/azure-data-lake-store-python/tree/master/azure

如果您想了解ADLS中的其他身份验证类型,请检查以下代码库。

https://github.com/Azure-Samples/data-lake-analytics-python-auth-options/blob/master/sample.py

答案 1 :(得分:0)

我试图编写示例代码以从Azure Data Lake中的csv文件读取数据到熊猫中的数据框。

这是我的示例代码,如下所示。

from azure.datalake.store import core, lib, multithread
import pandas as pd

tenant_id = '<your Azure AD tenant id>'
username = '<your username in AAD>'
password = '<your password>'
store_name = '<your ADL name>'
token = lib.auth(tenant_id, username, password)
# Or you can register an app to get client_id and client_secret to get token
# If you want to apply this code in your application, I recommended to do the authentication by client
# client_id = '<client id of your app registered in Azure AD, like xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxx'
# client_secret = '<your client secret>'
# token = lib.auth(tenant_id, client_id=client_id, client_secret=client_secret)

adl = core.AzureDLFileSystem(token, store_name=store_name)
f = adl.open('<your csv file path, such as data/test.csv in my ADL>')
df = pd.read_csv(f)

注意:如果您使用client_idclient_secret进行身份验证,则必须至少在Azure AD中为具有Reader角色的应用添加必要的访问权限,如图下面。有关访问安全性的更多信息,请参阅官方文档Security in Azure Data Lake Storage Gen1。同时,关于如何在Azure AD中注册应用程序,您可以参考我对其他SO线程How to get an AzureRateCard with Java?的回答。

enter image description here

enter image description here

任何担心,请随时让我知道。