我正在使用下面的Scala代码将CSV文件重命名为TXT文件并移动TXT文件。我需要将此代码转换为Python / Pyspark,但遇到问题(不精通Python)。非常感谢您的帮助。预先感谢!
//Prepare to rename file
import org.apache.hadoop.fs._
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(sc.hadoopConfiguration)
//Create variables
val table_name = dbutils.widgets.get("table_name") // getting table name
val filePath = "dbfs:/mnt/datalake/" + table_name + "/" // path where original csv file name is located
val fileName = fs.globStatus(new Path(filePath+"part*"))(0).getPath.getName // getting original csv file name
val newfilename = table_name + ".txt" // renaming and transforming csv into txt
val curatedfilePath = "dbfs:/mnt/datalake/" + newfilename // curated path + new file name
//Move to curated folder
dbutils.fs.mv(filePath + fileName, curatedfilePath)
这是Python代码
%python
#Create variables
table_name = dbutils.widgets.get("table_name") # getting table name
filePath = "dbfs:/mnt/datalake/" + table_name + "/" # path where original csv file name is located
newfilename = table_name + ".txt" # transforming csv into txt
curatedfilePath = "dbfs:/mnt/datalake/" + newfilename # curated path + new file name
#Save CSV file
df_curated.coalesce(1).replace("", None).write.mode("overwrite").save(filePath,format='csv', delimiter='|', header=True, nullValue=None)
# getting original csv file name
for f in filePath:
if f[1].startswith("part-00000"):
original_file_name = f[1]
#move to curated folder
dbutils.fs.mv(filePath + fileName, curatedfilePath)
“获取原始文件名”部分出现问题。它将引发以下错误:
IndexError: string index out of range
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<command-3442953727364942> in <module>()
11 # getting original csv file name
12 for f in filePath:
---> 13 if f[1].startswith("part-00000"):
14 original_file_name = f[1]
15
IndexError: string index out of range
答案 0 :(得分:2)
在Scala代码中,您使用hadoop.fs.golobStatus
从保存DataFrame的文件夹中列出零件文件。
在Python中,您可以通过以下方式通过JVM访问hadoop.fs
来做到这一点:
conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
part_files = Path(filePath).getFileSystem(conf).globStatus(Path(filePath + "/part*"))
file_name = part_files[0].getPath().getName()