Question

在Azure Databricks笔记本中，我尝试使用以下命令在blob存储中的某些csv上执行转换：

*import os
    import glob
    import pandas as pd
    os.chdir(r'wasbs://dalefactorystorage.blob.core.windows.net/dale')
    allFiles = glob.glob("*.csv") # match your csvs
    for file in allFiles:
       df = pd.read_csv(file)
       df = df.iloc[4:,] # read from row 4 onwards.
       df.to_csv(file)
       print(f"{file} has removed rows 0-3")*

不幸的是，我遇到以下错误：

* FileNotFoundError：[错误2]没有此类文件或目录：'wasbs：//dalefactorystorage.blob.core.windows.net/dale'

我错过了什么吗？（我对此完全陌生。）

干杯

戴尔

Answer 1

如果要使用包pandas从Azure blob读取CSV文件，请对其进行处理并写入将此CSV文件复制到Azure Databricks中的Azure blob，我建议您将Azure blob存储安装为Databricks文件系统，然后执行此操作。有关更多详细信息，请参阅here。

例如

安装Azure blob

dbutils.fs.mount(
  source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<account access key>"})

处理csv

import os
import glob
import pandas as pd

os.chdir(r'/dbfs/mnt/<mount-name>/<>')
allFiles = glob.glob("*.csv") # match your csvs
for file in allFiles:
    print(f" The old content of  file {file} : ")
    df= pd.read_csv(file, header=None)
    print(df)
    df = df.iloc[4:,]
    df.to_csv(file, index=False,header=False)
    print(f" The new content of  file {file} : ")
    df= pd.read_csv(file,header=None)
    print(df)
    break

与Databricks笔记本中的Blob存储文件进行交互的过程

1 个答案: