Spark group by - Pig转换

时间:2016-09-03 14:27:58

标签: apache-spark apache-pig

我想在火花中实现这样的事情。以下代码片段来自Pig Latin。无论如何我可以用Spark做同样的事情吗?

    A = load 'student' AS (name:chararray,age:int,gpa:float);
    DESCRIBE A; 

    A: {name: chararray,age: int,gpa: float} DUMP A; (John,18,4.0F)
    (Mary,19,3.8F) (Bill,20,3.9F) (Joe,18,3.8F)

    B = GROUP A BY age;

    Result:  (18,{(John,18,4.0F),(Joe,18,3.8F)}) (19,{(Mary,19,3.8F)})
    (20,{(Bill,20,3.9F)})

感谢。

1 个答案:

答案 0 :(得分:0)

按年龄划分名单很容易。我相信Spark API不允许您以相同的方式收集完整的行并获得完整的行列表。

// Input data
val df = {
    import org.apache.spark.sql._
    import org.apache.spark.sql.types._
    import scala.collection.JavaConverters._
    import java.time.LocalDate

    val simpleSchema = StructType(
        StructField("name", StringType) ::
        StructField("age", IntegerType) ::
        StructField("gpa", FloatType) :: Nil)

    val data = List(
        Row("John", 18, 4.0f),
        Row("Mary", 19, 3.8f),
        Row("Bill", 20, 3.9f),
        Row("Joe", 18, 3.8f)
    )    

    spark.createDataFrame(data.asJava, simpleSchema)
}
df.show()
val df2 = df.groupBy(col("age")).agg(collect_list(col("name")))
df2.show()