Spark SQL:将工作日名称转换为整数

时间:2016-04-29 01:09:08

标签: scala apache-spark apache-spark-sql

我正在尝试将星期几的字符串名称(例如TUESDAY)转换为整数(例如3)。我在下面写了一张地图,但我不确定如何应用它。

import org.apache.spark.sql.functions.lower

val dayNameToInteger = Map(
    "sunday"    -> 1,
    "monday"    -> 2,
    "tuesday"   -> 3,
    "wednesday" -> 4,
    "thursday"  -> 5,
    "friday"    -> 6,
    "saturday"  -> 7)


var input = sqlContext.createDataFrame(Seq(
    (0L, "SUNDAY", 34),
    (1L, "Monday", 31),
    (2L, "tuesday", 25)
    )).toDF("id", "day_of_week", "value")


scala> input.show
+---+-----------+-----+
| id|day_of_week|value|
+---+-----------+-----+
|  0|     SUNDAY|   34|
|  1|     Monday|   31|
|  2|    tuesday|   25|
+---+-----------+-----+


var output = input.select($"id", dayNameToInteger(lower(input("day_of_week"))))
<console>:27: error: type mismatch;
 found   : org.apache.spark.sql.Column
 required: String
       var output = input.select($"id", dayNameToInteger(lower(input("day_of_week"))))

2 个答案:

答案 0 :(得分:0)

转换应该通过UDF完成,因为它可以对所有行上的列进行操作。

val dayToInt = udf((dayOfWeek:String) => {
    dayOfWeek match {
        case "sunday"    => 1
        case "monday"    => 2
        case "tuesday"   => 3 
        case "wednesday" => 4
        case "thursday"  => 5
        case "friday"    => 6
        case "saturday"  => 7
    }
})  


var output = input.select($"id", dayToInt(lower(input("day_of_week"))).as("day_int"))


scala> output.show
+---+-------+
| id|day_int|
+---+-------+
|  0|      1|
|  1|      2|
|  2|      3|
+---+-------+

答案 1 :(得分:0)

作为上述答案的改进,这里是我的Spark 2.x兼容的udf,将星期字符串索引为整数:

byte[] getImage() {
    ByteArrayOutputStream baos = new ByteArrayOutputStream()
    ImageIO.write(ImageIO.read(new File('/path/to/file')), "jpg", baos)
    baos.toByteArray()
}

还用于机器学习模型:

<img ... src="${createLink(controller: 'imageController', action: 'getImage')}" />