我有这个输入DataFrame
input_df:
| C1 | C2 | C3 |
| ------------- |
| A | 1 | 12/06/2012 |
| A | 2 | 13/06/2012 |
| B | 3 | 12/06/2012 |
| B | 4 | 17/06/2012 |
| C | 5 | 14/06/2012 |
| ---------- |
并且在转换之后,我希望通过C1获得这种DataFrame分组并创建C4列,其形式是来自C2和C3的一对列表
output_df:
| C1 | C4 |
| --------------------------------------------- |
| A | (1,12 / 06/2012),(2,12 / 06/2012)|
| B | (3,12 / 06/2012),(4,12 / 06/2012)|
| C | (5,12 / 06/2012)|
| --------------------------------------------- |
当我尝试这个时,我会调用结果:
val output_df = input_df.map(x => (x(0), (x(1), x(2))) ).groupByKey()
我获得了这个结果
(A,CompactBuffer((1, 12/06/2012), (2, 13/06/2012)))
(B,CompactBuffer((3, 12/06/2012), (4, 17/06/2012)))
(C,CompactBuffer((5, 14/06/2012)))
但我不知道如何将其转换为DataFrame,如果这是做到这一点的好方法 即使采用其他方法,任何建议都是受欢迎的
答案 0 :(得分:1)
//请试试这个
val conf = new SparkConf().setAppName("groupBy").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val rdd = sc.parallelize(
Seq(("A",1,"12/06/2012"),("A",2,"13/06/2012"),("B",3,"12/06/2012"),("B",4,"17/06/2012"),("C",5,"14/06/2012")) )
val v1 = rdd.map(x => (x._1, x ))
val v2 = v1.groupByKey()
val v3 = v2.mapValues(v => v.toArray)
val df2 = v3.toDF("aKey","theValues")
df2.printSchema()
val first = df2.first
println (first)
println (first.getString(0))
val values = first.getSeq[Row](1)
val firstArray = values(0)
println (firstArray.getString(0)) //B
println (firstArray.getInt(1)) //3
println (firstArray.getString(2)) //12/06/2012