如何使用Spark-SQL将行转换为列?

时间:2015-06-22 14:18:48

标签: scala apache-spark apache-spark-sql

我的表t1中有以下数据

<script type="text/javascript">

    function printDiv(divName) {
         var printContents = document.getElementById(divName).innerHTML;
         var originalContents = document.body.innerHTML;
         document.body.innerHTML = printContents;
         window.print();
         document.body.innerHTML = originalContents;
    }

</script>


<div id="printableArea">CONTENT TO PRINT</div>



<input type="button" onclick="printDiv('printableArea')" value="Print Report" />

预期产出:

col1    | col2   |
sess-1  | read   |
sess-1  | meet   |
sess-1  | walk   |
sess-2  | watch  |
sess-2  | sleep  |
sess-2  | run    |
sess-2  | drive  |

我正在使用Spark 1.4.0

2 个答案:

答案 0 :(得分:0)

检查火花

  

aggregateByKey

   scala> val babyNamesCSV = sc.parallelize(List(("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5)))
babyNamesCSV: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at <console>:12


scala> babyNamesCSV.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect
res1: Array[(String, Int)] = Array((Abby,9), (David,11))

以上示例有助于理解

或聚合 https://spark.apache.org/docs/0.6.0/api/core/spark/Aggregator.html

答案 1 :(得分:0)

// create RDD data
scala> val data = sc.parallelize(List(("sess-1","read"), ("sess-1","meet"), 
    ("sess-1","walk"), ("sess-2","watch"),("sess-2","sleep"), 
    ("sess-2","run"),("sess-2","drive")))

//groupByKey will return Iterable[String] CompactBuffer**
scala> val dataCB = data.groupByKey()`

//map CompactBuffer to List
scala> val tx = dataCB.map{case (col1,col2)  => (col1,col2.toList)}.collect

data: org.apache.spark.rdd.RDD[(String, String)] =
ParallelCollectionRDD[211] at parallelize at <console>:26

dataCB: org.apache.spark.rdd.RDD[(String, Iterable[String])] =
ShuffledRDD[212] at groupByKey at <console>:30

tx: Array[(String, List[String])] = Array((sess-1,List(read, meet,
walk)), (sess-2,List(watch, sleep, run, drive)))

//groupByKey and map to List can also achieved in one statment
scala> val dataCB = data.groupByKey().map{case (col1,col2)  
    => (col1,col2.toList)}.collect