Question

除了我的文件位于Azure Data Lake Gen2中并且我在Databricks笔记本中使用pyspark外，我正在尝试实现与该Spark dataframe save in single file on hdfs location帖子相同的功能。

下面是我用来重命名文件的代码段

from py4j.java_gateway import java_import
java_import(spark._jvm, 'org.apache.hadoop.fs.Path')

destpath = "abfss://" + contianer + "@" + storageacct + ".dfs.core.windows.net/"
fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())
file = fs.globStatus(sc._jvm.Path(destpath+'part*'))[0].getPath().getName()
#Rename the file

我在此行上收到IndexError: list index out of range

file = fs.globStatus(sc._jvm.Path(destpath+'part*'))[0].getPath().getName()

part *文件确实存在于文件夹中。

1）这是重命名databricks（pyspark）写入Azure DataLakeGen2的文件的正确方法吗？如果没有，我还可以怎么做？

Answer 1

我可以通过在我的databricks笔记本中安装azure.storage.filedatalake 客户端库来解决此问题。通过使用FileSystemClient类和DataLakeFileClient类，我能够重命名数据湖gen2中的文件。

Spark数据框（在Azure Databricks中）保存在数据湖（gen2）上的单个文件中并重命名该文件

1 个答案: