我想使用三列进行计算,并生成显示所有三个值的单列

时间:2018-11-21 15:54:30

标签: scala apache-spark apache-spark-sql

我正在将数据加载到Spark Databrick的数据框中

spark.sql("""select A,X,Y,Z from fruits""")



   A    X       Y       Z
   1E5  1.000   0.000   0.000
   1U2  2.000   5.000   0.000
   5G6  3.000   0.000   10.000

我需要输出为

    A      D  
   1E5    X 1
   1U2    X 2, Y 5
   5G6    X 3, Z 10

我能够找到解决方案。

2 个答案:

答案 0 :(得分:0)

每个列名称都可以与值连接,然后所有值都可以在一个列中连接,并以逗号分隔:

// data
val df = Seq(
  ("1E5", 1.000, 0.000, 0.000),
  ("1U2", 2.000, 5.000, 0.000),
  ("5G6", 3.000, 0.000, 10.000))
  .toDF("A", "X", "Y", "Z")

// action
val columnsToConcat = List("X", "Y", "Z")
val columnNameValueList = columnsToConcat.map(c =>
  when(col(c) =!= 0, concat(lit(c), lit(" "), col(c).cast(IntegerType)))
    .otherwise("")
)
val valuesJoinedByComaColumn = columnNameValueList.reduce((a, b) =>
  when(org.apache.spark.sql.functions.length(a) =!= 0 && org.apache.spark.sql.functions.length(b) =!= 0, concat(a, lit(", "), b))
    .otherwise(concat(a, b))
)
val result = df.withColumn("D", valuesJoinedByComaColumn)
  .drop(columnsToConcat: _*)

输出:

+---+---------+
|A  |D        |
+---+---------+
|1E5|X 1      |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+

解决方案与stack0114106提出的解决方案相似,但看起来更加明确。

答案 1 :(得分:-1)

检查一下:

scala>  val df =  Seq(("1E5",1.000,0.000,0.000),("1U2",2.000,5.000,0.000),("5G6",3.000,0.000,10.000)).toDF("A","X","Y","Z")
df: org.apache.spark.sql.DataFrame = [A: string, X: double ... 2 more fields]

scala> df.show()
+---+---+---+----+
|  A|  X|  Y|   Z|
+---+---+---+----+
|1E5|1.0|0.0| 0.0|
|1U2|2.0|5.0| 0.0|
|5G6|3.0|0.0|10.0|
+---+---+---+----+

scala> val newcol = df.columns.drop(1).map( x=> when(col(x)===0,lit("")).otherwise(concat(lit(x),lit(" "),col(x).cast("int").cast("string"))) ).reduce( (x,y) => concat(x,lit(", "),y) )
newcol: org.apache.spark.sql.Column = concat(concat(CASE WHEN (X = 0) THEN  ELSE concat(X,  , CAST(CAST(X AS INT) AS STRING)) END, , , CASE WHEN (Y = 0) THEN  ELSE concat(Y,  , CAST(CAST(Y AS INT) AS STRING)) END), , , CASE WHEN (Z = 0) THEN  ELSE concat(Z,  , CAST(CAST(Z AS INT) AS STRING)) END)

scala> df.withColumn("D",newcol).withColumn("D",regexp_replace(regexp_replace('D,", ,",","),", $", "")).drop("X","Y","Z").show(false)
+---+---------+
|A  |D        |
+---+---------+
|1E5|X 1      |
|1U2|X 2, Y 5 |
|5G6|X 3, Z 10|
+---+---------+


scala>