与Databricks笔记本中的Blob存储文件进行交互的过程

时间:2020-10-12 08:09:14

标签: python azure-storage-blobs databricks

在Azure Databricks笔记本中,我尝试使用以下命令在blob存储中的某些csv上执行转换:

*import os
    import glob
    import pandas as pd
    os.chdir(r'wasbs://dalefactorystorage.blob.core.windows.net/dale')
    allFiles = glob.glob("*.csv") # match your csvs
    for file in allFiles:
       df = pd.read_csv(file)
       df = df.iloc[4:,] # read from row 4 onwards.
       df.to_csv(file)
       print(f"{file} has removed rows 0-3")*

不幸的是,我遇到以下错误:

* FileNotFoundError:[错误2]没有此类文件或目录:'wasbs://dalefactorystorage.blob.core.windows.net/dale'

我错过了什么吗? (我对此完全陌生。)

干杯

戴尔

1 个答案:

答案 0 :(得分:0)

如果要使用包pandas从Azure blob读取CSV文件,请对其进行处理并写入 将此CSV文件复制到Azure Databricks中的Azure blob,我建议您将Azure blob存储安装为Databricks文件系统,然后执行此操作。有关更多详细信息,请参阅here

例如

  1. 安装Azure blob
dbutils.fs.mount(
  source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<account access key>"})

enter image description here

  1. 处理csv
import os
import glob
import pandas as pd

os.chdir(r'/dbfs/mnt/<mount-name>/<>')
allFiles = glob.glob("*.csv") # match your csvs
for file in allFiles:
    print(f" The old content of  file {file} : ")
    df= pd.read_csv(file, header=None)
    print(df)
    df = df.iloc[4:,]
    df.to_csv(file, index=False,header=False)
    print(f" The new content of  file {file} : ")
    df= pd.read_csv(file,header=None)
    print(df)
    break

enter image description here enter image description here