我有这个数据框:
val df = Seq(
("thin", "Cell phone", 6000, 150, "01/01/2018"),
("Normal", "Tablet", 1500, 200, "01/01/2018"),
("Mini", "Tablet", 2000, 250, "02/01/2018"),
("Ultra thin", "Cell phone", 5000, 300, "02/01/2018"),
("Very thin", "Cell phone", 6000, 400, "03/01/2018"),
("Big", "Tablet", 4500, 250, "03/01/2018"),
("Bendable", "Cell phone", 3000, 200, "04/01/2018"),
("Fordable", "Cell phone", 3000, 150, "05/01/2018"),
("Pro", "Cell phone", 4500, 300, "06/01/2018"),
("Pro2", "Tablet", 6500, 350, "04/01/2018")).toDF("product", "category",
"revenue", "extra", "date")
我试图向该数据帧添加一个Column
,该数据帧包含一个基于列revenue
和extra
的操作。假设有一个min
操作,这样我得到一个Column
:
df.withColumn("output", min("revenue", "extra"))
我在使用火花函数时发现的问题是,这些min
,max
聚合是在Column
中垂直应用的。但是,我的目标是在各个列之间水平应用这些概念。
谢谢
答案 0 :(得分:0)
您需要为此使用UDF()。检查一下。
scala> val df = Seq(
| ("thin", "Cell phone", 6000, 150, "01/01/2018"),
| ("Normal", "Tablet", 1500, 200, "01/01/2018"),
| ("Mini", "Tablet", 2000, 250, "02/01/2018"),
| ("Ultra thin", "Cell phone", 5000, 300, "02/01/2018"),
| ("Very thin", "Cell phone", 6000, 400, "03/01/2018"),
| ("Big", "Tablet", 4500, 250, "03/01/2018"),
| ("Bendable", "Cell phone", 3000, 200, "04/01/2018"),
| ("Fordable", "Cell phone", 3000, 150, "05/01/2018"),
| ("Pro", "Cell phone", 4500, 300, "06/01/2018"),
| ("Pro2", "Tablet", 6500, 350, "04/01/2018")).toDF("product", "category",
| "revenue", "extra", "date")
df: org.apache.spark.sql.DataFrame = [product: string, category: string ... 3 more fields]
scala> df.printSchema
root
|-- product: string (nullable = true)
|-- category: string (nullable = true)
|-- revenue: integer (nullable = false)
|-- extra: integer (nullable = false)
|-- date: string (nullable = true)
scala> def min2col(x:Int,y:Int):Int =
| return if(x<y) x else y
min2col: (x: Int, y: Int)Int
scala> val myudfmin2col = udf( min2col(_:Int,_:Int):Int )
myudfmin2col: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(IntegerType, IntegerType)))
scala> df.withColumn("output",myudfmin2col('extra,'revenue)).show(false)
+----------+----------+-------+-----+----------+------+
|product |category |revenue|extra|date |output|
+----------+----------+-------+-----+----------+------+
|thin |Cell phone|6000 |150 |01/01/2018|150 |
|Normal |Tablet |1500 |200 |01/01/2018|200 |
|Mini |Tablet |2000 |250 |02/01/2018|250 |
|Ultra thin|Cell phone|5000 |300 |02/01/2018|300 |
|Very thin |Cell phone|6000 |400 |03/01/2018|400 |
|Big |Tablet |4500 |250 |03/01/2018|250 |
|Bendable |Cell phone|3000 |200 |04/01/2018|200 |
|Fordable |Cell phone|3000 |150 |05/01/2018|150 |
|Pro |Cell phone|4500 |300 |06/01/2018|300 |
|Pro2 |Tablet |6500 |350 |04/01/2018|350 |
+----------+----------+-------+-----+----------+------+
scala>
EDIT1:
scala> df.createOrReplaceTempView("product")
scala> spark.sql("select product,category,revenue,extra,date, case when revenue<extra then revenue else extra end as minextra from product ").show(false)
+----------+----------+-------+-----+----------+--------+
|product |category |revenue|extra|date |minextra|
+----------+----------+-------+-----+----------+--------+
|thin |Cell phone|6000 |150 |01/01/2018|150 |
|Normal |Tablet |1500 |200 |01/01/2018|200 |
|Mini |Tablet |2000 |250 |02/01/2018|250 |
|Ultra thin|Cell phone|5000 |300 |02/01/2018|300 |
|Very thin |Cell phone|6000 |400 |03/01/2018|400 |
|Big |Tablet |4500 |250 |03/01/2018|250 |
|Bendable |Cell phone|3000 |200 |04/01/2018|200 |
|Fordable |Cell phone|3000 |150 |05/01/2018|150 |
|Pro |Cell phone|4500 |300 |06/01/2018|300 |
|Pro2 |Tablet |6500 |350 |04/01/2018|350 |
+----------+----------+-------+-----+----------+--------+
scala>