透视DataFrame - Spark SQL

时间:2017-06-13 12:21:57

标签: java scala apache-spark apache-spark-sql pivot

我有一个包含以下内容的DataFrame:

TradeId|Source
ABC|"USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602"

我想转动这些数据,以便转到下面

TradeId|CCY|PV
ABC|USD|333.123
ABC|USD|-789.444
ABC|GBP|1234.567

列中的 CCY | PV |日期三元组的数量"来源"不固定。我可以在ArrayList中做到这一点,但这需要在JVM中加载数据并击败Spark的全部内容。

让我说我的DataFrame如下所示:

DataFrame tradesSnap = this.loadTradesSnap(reportRequest);
String tempTable = getTempTableName();
tradesSnap.registerTempTable(tempTable);
tradesSnap = tradesSnap.sqlContext().sql("SELECT TradeId, Source FROM " + tempTable);

2 个答案:

答案 0 :(得分:2)

您尝试实现的目标更像是flatMap

,而不是转动

简单地说,通过在flatMap上使用Dataset,您可以向每一行应用一个函数(map),该函数本身会产生一系列行。然后将每组行连接成单个序列(flat)。

以下程序显示了这个想法:

import org.apache.spark.sql.SparkSession

case class Input(TradeId: String, Source: String)

case class Output(TradeId: String, CCY: String, PV: String, Date: String)

object FlatMapExample {

  // This function will produce more rows of output for each line of input
  def splitSource(in: Input): Seq[Output] =
    in.Source.split("\\|", -1).map {
      source =>
        println(source)
        val Array(ccy, pv, date) = source.split(",", -1)
        Output(in.TradeId, ccy, pv, date)
    }

  def main(args: Array[String]): Unit = {

    // Initialization and loading
    val spark = SparkSession.builder().master("local").appName("pivoting-example").getOrCreate()
    import spark.implicits._
    val input = spark.read.options(Map("sep" -> "|", "header" -> "true")).csv(args(0)).as[Input]

    // For each line in the input, split the source and then 
    // concatenate each "sub-sequence" in a single `Dataset`
    input.flatMap(splitSource).show
  }

}

根据您的输入,这将是输出:

+-------+---+--------+--------+
|TradeId|CCY|      PV|    Date|
+-------+---+--------+--------+
|    ABC|USD| 333.123|20170605|
|    ABC|USD|-789.444|20170605|
|    ABC|GBP|1234.567|20150602|
+-------+---+--------+--------+

如果需要,您现在可以获取结果并将其保存为CSV。

答案 1 :(得分:2)

如果您阅读databricks pivot,则说" A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns."这不是您想要的,我猜

我建议您使用withColumnfunctions来获得所需的最终输出。考虑到dataframe就是你所拥有的

,你可以这样做
+-------+----------------------------------------------------------------+
|TradeId|Source                                                          |
+-------+----------------------------------------------------------------+
|ABC    |USD,333.123,20170605|USD,-789.444,20170605|GBP,1234.567,20150602|
+-------+----------------------------------------------------------------+

您可以使用explodesplitwithColumn执行以下操作以获得所需的输出

val explodedDF = dataframe.withColumn("Source", explode(split(col("Source"), "\\|")))
val finalDF = explodedDF.withColumn("CCY", split($"Source", ",")(0))
  .withColumn("PV", split($"Source", ",")(1))
  .withColumn("Date",  split($"Source", ",")(2))
  .drop("Source")

finalDF.show(false)

最终输出是

+-------+---+--------+--------+
|TradeId|CCY|PV      |Date    |
+-------+---+--------+--------+
|ABC    |USD|333.123 |20170605|
|ABC    |USD|-789.444|20170605|
|ABC    |GBP|1234.567|20150602|
+-------+---+--------+--------+

我希望这能解决你的问题