每个列值的火花计数和百分比异常处理并加载到Hive DB

时间:2019-01-24 15:37:07

标签: scala apache-spark hadoop hive apache-spark-sql

在下面的Scala Spark代码中,我需要找到不同列的计数及其在值中所占的百分比。为此,我需要对withColumndateusagepaymentdateFinal,{ {1}}。

对于每次计算,我都需要使用usageFinal来获取总和和汇总。有什么我不需要写的方式吗?

paymentFinal

每一次?例如,如下面的代码所示。

withColumn

现在我的代码是下面的代码,所以您可以帮助我们为不同的列(例如,日期,用途等)添加条件吗(例如,在代码中,我们获取的包含日期的列比我们添加的计数和其他条件要多)现在,我们希望这些内容具有动态性,所有列名称都应放在一个yml文件中,并且必须从该文件中读取这些名称。我该如何实现这一点?任何人都可以提供帮助;读取YML文件后,我将如何修改我的代码,请提供帮助。

.withColumn("SUM", sum("count").over() ).withColumn("fraction", col("count") / sum("count").over()).withColumn("Percent", col("fraction") * 100 ).drop("fraction")

2 个答案:

答案 0 :(得分:1)

您可以将所有.withColumn()操作封装在一个函数中,该函数将在应用所有操作后返回DataFrame

def getCountPercent(df: DataFrame): DataFrame = {
  df.withColumn("SUM", sum("count").over() )
    .withColumn("fraction", col("count") / sum("count").over())
    .withColumn("Percent", col("fraction") * 100 )
    .drop("fraction")
}  

用法:

使用.transform()应用该功能:

var dateFinalDF = dateFinal.toDF(DateColumn).groupBy(DateColumn).count.transform(getCountPercent)
var usageFinalDF = usageFinal.toDF(UsageColumn).groupBy(UsageColumn).count.transform(getCountPercent)

答案 1 :(得分:0)

另一种方式...一种scala集合-zip /地图样式:-)

scala> val df = Seq((10,20,30),(15,25,35)).toDF("date", "usage", "payment")
df: org.apache.spark.sql.DataFrame = [date: int, usage: int ... 1 more field]

scala> df.show(false)
+----+-----+-------+
|date|usage|payment|
+----+-----+-------+
|10  |20   |30     |
|15  |25   |35     |
+----+-----+-------+


scala> df.columns
res75: Array[String] = Array(date, usage, payment)

scala> var df2,df3,df4 = df
df2: org.apache.spark.sql.DataFrame = [date: int, usage: int ... 1 more field]
df3: org.apache.spark.sql.DataFrame = [date: int, usage: int ... 1 more field]
df4: org.apache.spark.sql.DataFrame = [date: int, usage: int ... 1 more field]

scala> val arr_all = Array(df2,df3,df4).zip(df.columns).map( d => d._1.groupBy(d._2).count.withColumn("sum",sum('count).over()).withColumn("fraction", col("count") / sum("count").over()).withColumn("Percent", col("fraction") * 100 ).drop("fraction") )
arr_all: Array[org.apache.spark.sql.DataFrame] = Array([date: int, count: bigint ... 2 more fields], [usage: int, count: bigint ... 2 more fields], [payment: int, count: bigint ... 2 more fields])

scala> val Array(dateFinalDF,usageFinalDF,paymentFinalDF) = arr_all
dateFinalDF: org.apache.spark.sql.DataFrame = [date: int, count: bigint ... 2 more fields]
usageFinalDF: org.apache.spark.sql.DataFrame = [usage: int, count: bigint ... 2 more fields]
paymentFinalDF: org.apache.spark.sql.DataFrame = [payment: int, count: bigint ... 2 more fields]

scala> dateFinalDF.show(false)
2019-01-25 04:10:10 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+-----+---+-------+
|date|count|sum|Percent|
+----+-----+---+-------+
|15  |1    |2  |50.0   |
|10  |1    |2  |50.0   |
+----+-----+---+-------+


scala> usageFinalDF.show(false)
2019-01-25 04:10:20 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-----+-----+---+-------+
|usage|count|sum|Percent|
+-----+-----+---+-------+
|20   |1    |2  |50.0   |
|25   |1    |2  |50.0   |
+-----+-----+---+-------+


scala> paymentFinalDF.show(false)
2019-01-25 04:10:50 WARN  WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+-------+-----+---+-------+
|payment|count|sum|Percent|
+-------+-----+---+-------+
|35     |1    |2  |50.0   |
|30     |1    |2  |50.0   |
+-------+-----+---+-------+


scala>

请注意,我已经分解并包含了var (df2,df3,df4) = df,以便于遵循这些步骤。

它们都可以像这样合并。

scala> val Array(dateFinalDF,usageFinalDF,paymentFinalDF) = Array(df,df,df).zip(df.columns).map( d => d._1.groupBy(d._2).count.withColumn("sum",sum('count).over()).withColumn("fraction", col("count") / sum("count").over()).withColumn("Percent", col("fraction") * 100 ).drop("fraction") )
dateFinalDF: org.apache.spark.sql.DataFrame = [date: int, count: bigint ... 2 more fields]
usageFinalDF: org.apache.spark.sql.DataFrame = [usage: int, count: bigint ... 2 more fields]
paymentFinalDF: org.apache.spark.sql.DataFrame = [payment: int, count: bigint ... 2 more fields]

scala>