在下面的Scala Spark代码中,我需要找到不同列的计数及其在值中所占的百分比。为此,我需要对withColumn
,date
,usage
,payment
,dateFinal
,{ {1}}。
对于每次计算,我都需要使用usageFinal
来获取总和和汇总。有什么我不需要写的方式吗?
paymentFinal
每一次?例如,如下面的代码所示。
withColumn
现在我的代码是下面的代码,所以您可以帮助我们为不同的列(例如,日期,用途等)添加条件吗(例如,在代码中,我们获取的包含日期的列比我们添加的计数和其他条件要多)现在,我们希望这些内容具有动态性,所有列名称都应放在一个yml文件中,并且必须从该文件中读取这些名称。我该如何实现这一点?任何人都可以提供帮助;读取YML文件后,我将如何修改我的代码,请提供帮助。
.withColumn("SUM", sum("count").over() ).withColumn("fraction", col("count") / sum("count").over()).withColumn("Percent", col("fraction") * 100 ).drop("fraction")
答案 0 :(得分:1)
您可以将所有.withColumn()
操作封装在一个函数中,该函数将在应用所有操作后返回DataFrame
。
def getCountPercent(df: DataFrame): DataFrame = {
df.withColumn("SUM", sum("count").over() )
.withColumn("fraction", col("count") / sum("count").over())
.withColumn("Percent", col("fraction") * 100 )
.drop("fraction")
}
用法:
使用.transform()
应用该功能:
var dateFinalDF = dateFinal.toDF(DateColumn).groupBy(DateColumn).count.transform(getCountPercent)
var usageFinalDF = usageFinal.toDF(UsageColumn).groupBy(UsageColumn).count.transform(getCountPercent)
答案 1 :(得分:0)
另一种方式...一种scala集合-zip /地图样式:-)
scala> val df = Seq((10,20,30),(15,25,35)).toDF("date", "usage", "payment")
df: org.apache.spark.sql.DataFrame = [date: int, usage: int ... 1 more field]
scala> df.show(false)
+----+-----+-------+
|date|usage|payment|
+----+-----+-------+
|10 |20 |30 |
|15 |25 |35 |
+----+-----+-------+
scala> df.columns
res75: Array[String] = Array(date, usage, payment)
scala> var df2,df3,df4 = df
df2: org.apache.spark.sql.DataFrame = [date: int, usage: int ... 1 more field]
df3: org.apache.spark.sql.DataFrame = [date: int, usage: int ... 1 more field]
df4: org.apache.spark.sql.DataFrame = [date: int, usage: int ... 1 more field]
scala> val arr_all = Array(df2,df3,df4).zip(df.columns).map( d => d._1.groupBy(d._2).count.withColumn("sum",sum('count).over()).withColumn("fraction", col("count") / sum("count").over()).withColumn("Percent", col("fraction") * 100 ).drop("fraction") )
arr_all: Array[org.apache.spark.sql.DataFrame] = Array([date: int, count: bigint ... 2 more fields], [usage: int, count: bigint ... 2 more fields], [payment: int, count: bigint ... 2 more fields])
scala> val Array(dateFinalDF,usageFinalDF,paymentFinalDF) = arr_all
dateFinalDF: org.apache.spark.sql.DataFrame = [date: int, count: bigint ... 2 more fields]
usageFinalDF: org.apache.spark.sql.DataFrame = [usage: int, count: bigint ... 2 more fields]
paymentFinalDF: org.apache.spark.sql.DataFrame = [payment: int, count: bigint ... 2 more fields]
scala> dateFinalDF.show(false)
2019-01-25 04:10:10 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+-----+---+-------+
|date|count|sum|Percent|
+----+-----+---+-------+
|15 |1 |2 |50.0 |
|10 |1 |2 |50.0 |
+----+-----+---+-------+
scala> usageFinalDF.show(false)
2019-01-25 04:10:20 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----+-----+---+-------+
|usage|count|sum|Percent|
+-----+-----+---+-------+
|20 |1 |2 |50.0 |
|25 |1 |2 |50.0 |
+-----+-----+---+-------+
scala> paymentFinalDF.show(false)
2019-01-25 04:10:50 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-------+-----+---+-------+
|payment|count|sum|Percent|
+-------+-----+---+-------+
|35 |1 |2 |50.0 |
|30 |1 |2 |50.0 |
+-------+-----+---+-------+
scala>
请注意,我已经分解并包含了var (df2,df3,df4) = df
,以便于遵循这些步骤。
它们都可以像这样合并。
scala> val Array(dateFinalDF,usageFinalDF,paymentFinalDF) = Array(df,df,df).zip(df.columns).map( d => d._1.groupBy(d._2).count.withColumn("sum",sum('count).over()).withColumn("fraction", col("count") / sum("count").over()).withColumn("Percent", col("fraction") * 100 ).drop("fraction") )
dateFinalDF: org.apache.spark.sql.DataFrame = [date: int, count: bigint ... 2 more fields]
usageFinalDF: org.apache.spark.sql.DataFrame = [usage: int, count: bigint ... 2 more fields]
paymentFinalDF: org.apache.spark.sql.DataFrame = [payment: int, count: bigint ... 2 more fields]
scala>