我有一个输入数据框,我想在其中将相似类型的记录转换成一条记录。例如,输入数据帧包含procdata_*
条记录的许多条目,而我只希望在输出数据帧中有一个条目,如下所示:
输入数据框:
+-------------------+----------+------+--------------------+--------------------+------------+------------+---------------+
| File_name |Cycle_date|Status| Source_time| Target_time|Source_count|Target_count|Missing_Records|
+-----------------------+----------+------+--------------------+--------------------+------------+------------+---------------+
|data_20171223_f.csv| 20180911| PASS|2018-12-05 10:37:10 |2018-12-05 10:37:12 | 5| 5| 0|
|data_20180421_f.csv| 20180911| PASS|2018-12-05 10:37:10 |2018-12-05 10:37:12 | 5| 4| 1|
|data_20171007_f.csv| 20180911| PASS|2018-12-05 10:37:12 |2018-12-05 10:37:12 | 6| 4| 2|
|data_20160423_f.csv| 20180911| PASS|2018-12-05 10:37:14 |2018-12-05 10:37:15 | 4| 4| 0|
|data_20180106_f.csv| 20180911| PASS|2018-12-05 10:37:15 |2018-12-05 10:37:15 | 10| 9| 1|
|raw_20180120_f.csv | 20180911| PASS|2018-12-05 10:37:16 |2018-12-05 10:37:17 | 10| 10| 0|
|raw_20171202_f.csv | 20180911| PASS|2018-12-05 10:37:17 |2018-12-05 10:37:18 | 2| 2| 0|
|raw_20151219_f.csv | 20180911| PASS|2018-12-05 10:37:17 |2018-12-05 10:37:18 | 10| 10| 0|
|raw_20151031_f.csv | 20180911| PASS|2018-12-05 10:37:17 |2018-12-05 10:37:18 | 8| 8| 0|
|raw_20170204_f.csv | 20180911| PASS|2018-12-05 10:37:18 |2018-12-05 10:37:18 | 12| 10| 2|
|eeight.csv | 20180911| FAIL|2018-12-05 10:37:18 |2018-12-05 10:37:19 | 10| 10| 10|
+-----------------------+----------+------+--------------------+--------------------+------------+------------+---------------+
输出数据框:
+-----------------------+----------+------+--------------------+--------------------+------------+------------+---------------+
| File_name |Cycle_date|Status| Source_time| Target_time|Source_count|Target_count|Missing_Records|
+-----------------------+----------+------+--------------------+--------------------+------------+------------+---------------+
|data.csv | 20180911| PASS|2018-12-05 10:37:10 |2018-12-05 10:37:15 | 30| 26| 4|
|raw.csv | 20180911| PASS|2018-12-05 10:37:16 |2018-12-05 10:37:18 | 42| 40| 2|
|eeight.csv | 20180911| FAIL|2018-12-05 10:37:18 |2018-12-05 10:37:19 | 10| 10| 0|
+-----------------------+----------+------+--------------------+--------------------+------------+------------+---------------+
如何在Spark中实现?
答案 0 :(得分:2)
解决此问题的一种方法是在File_name
上的_
中拆分字符串,仅保留第一部分。然后执行groupBy
并根据需要聚合列。
可以执行以下操作,可以更改聚合以满足特定需求:
df.withColumn("File_name", concat(split($"File_name", "_|\\.").getItem(0), lit(".csv")))
.groupBy($"File_name")
.agg(
first($"Cycle_date") as "Cycle_date",
first($"Status") as "Status",
first($"Source_time") as "Source_time",
last($"Target_time") as "Target_time",
sum($"Source_count") as "Source_count",
sum($"Target_count") as "Target_count",
sum($"Missing_Records") as "Missing_Records"
)
上面的代码还在.
上进行了拆分,然后在.csv
列中没有_
的情况下为方便起见在后面添加了File_name
部分。
答案 1 :(得分:0)
您可以使用正则表达式转换File_name列以获取procdata / rawdata,然后使用“行号”窗口功能仅选择一行。 检查一下:
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> :paste
// Entering paste mode (ctrl-D to finish)
val df = Seq(("procdata_20171223_f.csv","20180911","PASS","2018-12-05 10:37:10","2018-12-05 10:37:12","5","5","0"),("procdata_20180421_f.csv","20180911","PASS","2018-12-05 10:37:10","2018-12-05 10:37:12","5","4","1"),("procdata_20171007_f.csv","20180911","PASS","2018-12-05 10:37:12","2018-12-05 10:37:12","6","4","2"),("procdata_20160423_f.csv","20180911","PASS","2018-12-05 10:37:14","2018-12-05 10:37:15","4","4","0"),("procdata_20180106_f.csv","20180911","PASS","2018-12-05 10:37:15","2018-12-05 10:37:15","10","9","1"),("rawdata_20180120_f.csv","20180911","PASS","2018-12-05 10:37:16","2018-12-05 10:37:17","10","10","0"),("rawdata_20171202_f.csv","20180911","PASS","2018-12-05 10:37:17","2018-12-05 10:37:18","2","2","0"),("rawdata_20151219_f.csv","20180911","PASS","2018-12-05 10:37:17","2018-12-05 10:37:18","10","10","0"),("rawdata_20151031_f.csv","20180911","PASS","2018-12-05 10:37:17","2018-12-05 10:37:18","8","8","0"),("rawdata_20170204_f.csv","20180911","PASS","2018-12-05 10:37:18","2018-12-05 10:37:18","12","10","2"),("itemweight.csv","20180911","FAIL","2018-12-05 10:37:18","2018-12-05 10:37:19","10","10","10")).toDF("File_name","Cycle_date","Status","Source_time","Target_time","Source_count","Target_count","Missing_Records")
// Exiting paste mode, now interpreting.
df: org.apache.spark.sql.DataFrame = [File_name: string, Cycle_date: string ... 6 more fields]
scala> df.withColumn("File_name",regexp_replace('File_name,"""_.*""",".csv")).withColumn("row1",row_number().over(Window.partitionBy('File_name).orderBy('File_name))).filter(" row1=1").drop("row1").show(false)
+--------------+----------+------+-------------------+-------------------+------------+------------+---------------+
|File_name |Cycle_date|Status|Source_time |Target_time |Source_count|Target_count|Missing_Records|
+--------------+----------+------+-------------------+-------------------+------------+------------+---------------+
|rawdata.csv |20180911 |PASS |2018-12-05 10:37:17|2018-12-05 10:37:18|10 |10 |0 |
|procdata.csv |20180911 |PASS |2018-12-05 10:37:14|2018-12-05 10:37:15|4 |4 |0 |
|itemweight.csv|20180911 |FAIL |2018-12-05 10:37:18|2018-12-05 10:37:19|10 |10 |10 |
+--------------+----------+------+-------------------+-------------------+------------+------------+---------------+
scala>