将Scala代码转换为重命名和移动CSV文件-Spark-PySpark

时间:2019-12-16 16:28:39

标签: python scala apache-spark pyspark azure-databricks

我正在使用下面的Scala代码将CSV文件重命名为TXT文件并移动TXT文件。我需要将此代码转换为Python / Pyspark,但遇到问题(不精通Python)。非常感谢您的帮助。预先感谢!

//Prepare to rename file
import org.apache.hadoop.fs._
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(sc.hadoopConfiguration)

//Create variables
val table_name = dbutils.widgets.get("table_name") // getting table name
val filePath = "dbfs:/mnt/datalake/" + table_name + "/" // path where original csv file name is located
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName // getting original csv file name
val newfilename = table_name + ".txt" // renaming and transforming csv into txt
val curatedfilePath = "dbfs:/mnt/datalake/" + newfilename // curated path + new file name

//Move to curated folder
dbutils.fs.mv(filePath + fileName, curatedfilePath)

这是Python代码

%python

#Create variables
table_name = dbutils.widgets.get("table_name") # getting table name
filePath = "dbfs:/mnt/datalake/" + table_name + "/" # path where original csv file name is located
newfilename = table_name + ".txt" # transforming csv into txt
curatedfilePath = "dbfs:/mnt/datalake/" + newfilename # curated path + new file name

#Save CSV file
df_curated.coalesce(1).replace("", None).write.mode("overwrite").save(filePath,format='csv', delimiter='|', header=True, nullValue=None)

# getting original csv file name
for f in filePath:
            if f[1].startswith("part-00000"): 
                 original_file_name = f[1]

#move to curated folder
dbutils.fs.mv(filePath + fileName, curatedfilePath)

“获取原始文件名”部分出现问题。它将引发以下错误:

IndexError: string index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<command-3442953727364942> in <module>()
     11 # getting original csv file name
     12 for f in filePath:
---> 13             if f[1].startswith("part-00000"):
     14                  original_file_name = f[1]
     15 

IndexError: string index out of range

1 个答案:

答案 0 :(得分:2)

在Scala代码中,您使用hadoop.fs.golobStatus从保存DataFrame的文件夹中列出零件文件。

在Python中,您可以通过以下方式通过JVM访问hadoop.fs来做到这一点:

conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path

part_files = Path(filePath).getFileSystem(conf).globStatus(Path(filePath + "/part*"))
file_name = part_files[0].getPath().getName()