如何在scala上的dataframe中对我的字段值进行分区

时间:2016-08-19 11:42:46

标签: scala apache-spark spark-dataframe

我有一个数据框,其架构如下:

root
|-- school: string (nullable = true)
|-- questionName: string (nullable = true)
|-- difficultyValue: double (nullable = true)

数据是这样的:

school   | questionName | difficultyValue
school1  | q1           | 0.32
school1  | q2           | 0.13
school1  | q3           | 0.58
school1  | q4           | 0.67
school1  | q5           | 0.59
school1  | q6           | 0.43
school1  | q7           | 0.31
school1  | q8           | 0.15
school1  | q9           | 0.21
school1  | q10          | 0.92

但是现在我想根据它的值对字段“difficultyValue”进行分区,并将此数据帧转换为架构所遵循的新数据帧:

root
|-- school: string (nullable = true)
|-- difficulty1: double (nullable = true)
|-- difficulty2: double (nullable = true)
|-- difficulty3: double (nullable = true)
|-- difficulty4: double (nullable = true)
|-- difficulty5: double (nullable = true)

和新数据表在这里:

school   | difficulty1 | difficulty2 | difficulty3 | difficulty4 | difficulty5
school1  | 2           | 3           | 3           | 1           |1

字段“difficulty1”的值是“difficultyValue”的数量< 0.2;

字段“difficulty2”的值是“difficultyValue”的数量< 0.4和“difficultyValue”> = 0.2;

字段“difficulty3”的值是“difficultyValue”的数量< 0.6和“difficultyValue”> = 0.4;

字段“difficulty4”的值是“difficultyValue”的数量< 0.8和“difficultyValue”> = 0.6;

字段“difficulty5”的值是“difficultyValue”的数量< 1.0和“difficultyValue”> = 0.8;

我不知道如何改造它,我该怎么做?

2 个答案:

答案 0 :(得分:1)

// First create a test data frame with the schema of your given source.
val df = {
    import org.apache.spark.sql._
    import org.apache.spark.sql.types._
    import scala.collection.JavaConverters._

    val simpleSchema = StructType(
        StructField("school", StringType, false) ::
        StructField("questionName", StringType, false) ::
        StructField("difficultyValue", DoubleType) :: Nil)

    val data = List(
        Row("school1", "q1", 0.32),
        Row("school1", "q2", 0.45),
        Row("school1", "q3", 0.22),
        Row("school1", "q4", 0.12),
        Row("school2", "q1", 0.32),
        Row("school2", "q2", 0.42),
        Row("school2", "q3", 0.52),
        Row("school2", "q4", 0.62)
    )    

    spark.createDataFrame(data.asJava, simpleSchema)
}
// Add a new column that is the 1-5 category.
val df2 = df.withColumn("difficultyCat", floor(col("difficultyValue").multiply(5.0)) + 1)
// groupBy and pivot to get the final view that you want.
// Here, we know 1-5 values before-hand, if you don't you can omit with performance cost.
val df3 = df2.groupBy("school").pivot("difficultyCat", Seq(1, 2, 3, 4, 5)).count()

df3.show()

答案 1 :(得分:0)

以下功能:

def valueToIndex(v: Double): Int = scala.math.ceil(v*5).toInt

将从难度值确定您想要的索引,因为您只需要5个均匀的分档。您可以使用此函数使用withColumnudf创建新的派生列,然后您可以使用pivot生成每个索引的行数。