我在azure blob存储中有一个csv文件,其中包含以下详细信息。
基于TRUE / FALSE,我必须采用Year / Month(月份/月份)并作为参数来找到用于复制活动的源文件夹,作为字符串“ Folder \ Year \ Month * .csv”的一部分。
将这些参数作为源字符串循环传递,以收集文件夹中存在的文件并将其粘贴到我的目标文件夹中。
我想在循环中获取值以获取源字符串并作为变量传递。 我不需要用新列“ Foldercolumn”更新csv或在所有记录的基础上创建新的数据框。
+-------------+--------------+--------------------+-----------------+--------+
|Calendar_year|Calendar_month|EDAP_Data_Load_Statu|lake_refined_date|isreload|
+-------------+--------------+--------------------+-----------------+--------+
| 2018| 12| HIST| 20190829| FALSE|
| 2019| 1| HIST| 20190829| FALSE|
| 2019| 2| HIST| 20190829| FALSE|
| 2019| 3| HIST| 20190829| TRUE|
| 2019| 4| HIST| 20190829| FALSE|
| 2019| 5| HIST| 20190829| TRUE|
| 2019| 11| HIST| 20190829| FALSE|
+-------------+--------------+--------------------+-----------------+--------+
以下是上述要求的启动代码
val destinationContainerPath= "Finance/Data"
val dfCSVLogs = readCSV(s"$destinationContainerPath/sourcecsv.csv")
val dfTRUEcsv = dfCSVLogs.select(dfCSVLogs.col("*")).filter("isreload =='TRUE'")
获取每列con协调的字符串
IF isreload =='TRUE'
strFoldercolumn Calendar_month
strFoldercolumn = 2019/03
strFoldercolumn = 2019/05
end if
this is by default get the max value and get the parameter of max value
var Foldercolumn max(Calendar_year ),max(Calendar_month )
strFoldercolumn = 2019/11
我必须为每个strFoldercolumn循环并从文件中收集数据并粘贴到存储Blob中的另一个目标
答案 0 :(得分:0)
//read input control CSV file
scala> val df = spark.read.format("csv").option("header", "true").load("file.csv")
scala> df.show(false)
+-------------+--------------+--------------------+-----------------+--------+
|Calendar_year|Calendar_month|EDAP_Data_Load_Statu|lake_refined_date|isreload|
+-------------+--------------+--------------------+-----------------+--------+
|2018 |12 |HIST |20190829 |FALSE |
|2019 |2 |HIST |20190829 |FALSE |
|2019 |3 |HIST |20190829 |TRUE |
|2019 |4 |HIST |20190829 |FALSE |
|2019 |11 |HIST |20190829 |FALSE |
|2019 |5 |HIST |20190829 |TRUE |
+-------------+--------------+--------------------+-----------------+--------+
//initialize variable for max year and month
//note: below execution cam be modified on the basis of your requirement simply use filter to get max of particular condition
scala> val maxYearMonth = df.select(struct(col("Calendar_year").cast("Int"), col("Calendar_month").cast("Int")) as "ym").agg(max("ym") as "max").selectExpr("stack(1,max.col1,max.col2) as (year, month)").select( concat(col("year"), lit("/") ,col("month"))).rdd.collect.map( r => r(0)).mkString
res56: maxYearMonth = 2019/11
//Adding column temparary in input DataFrame
scala> val df2 = df.withColumn("strFoldercolumn", when(col("isreload") === "TRUE", concat(col("Calendar_year"), lit("/"),col("Calendar_month"))).otherwise(lit(maxYearMonth)))
scala> df2.show(false)
+-------------+--------------+--------------------+-----------------+--------+-----------+
|Calendar_year|Calendar_month|EDAP_Data_Load_Statu|lake_refined_date|isreload|strFoldercolumn|
+-------------+--------------+--------------------+-----------------+--------+-----------+
|2018 |12 |HIST |20190829 |FALSE |2019/11 |
|2019 |2 |HIST |20190829 |FALSE |2019/11 |
|2019 |3 |HIST |20190829 |TRUE |2019/3 |
|2019 |4 |HIST |20190829 |FALSE |2019/11 |
|2019 |11 |HIST |20190829 |FALSE |2019/11 |
|2019 |5 |HIST |20190829 |TRUE |2019/5 |
+-------------+--------------+--------------------+-----------------+--------+-----------+
//move value of column strFoldercolumn into strFoldercolumn list variable
scala> val strFoldercolumn = df2.select("strFoldercolumn").distinct.rdd.collect.toList
strFoldercolumn: List[org.apache.spark.sql.Row] = List([2019/5], [2019/11], [2019/3])
//lopping each value
scala>strFoldercolumn.foreach { x =>
| val csvPath = "folder/" + x.toString + "/*.csv"
| val srcdf = spark.read.format("csv").option("header", "true").load(csvPath)
| // Write logic to copy or write srcdf to your destination folder
|
| }