我有以下df:
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| fn| red|
| 2| fn| blue|
| 3| fn|green|
+---+----+-----+
如果任何颜色列值为red
,那么我应将颜色列的所有值更新为red
,如下所示:
+---+----+-----+
|sno|dept|color|
+---+----+-----+
| 1| fn| red|
| 2| fn| red|
| 3| fn| red|
+---+----+-----+
我无法理解。请帮忙;我试过以下代码:
val gp=jdbcDF.filter($"dept".contains("fn"))
//.withColumn("newone",when($"dept"==="fn","RED").otherwise("NULL"))
gp.show()
gp.map(
row=>{
val row1=row.getAs[String](1)
var row2=row.getAs[String](2)
val make=if(row1 =="fn") row2="red"
Row(row(0),row(1),make)
}
).collect().foreach(println)
答案 0 :(得分:10)
假设:
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "fn", "color")
进行计算:
val redOrNot = df.groupBy("fn")
.agg(collect_set('color) as "values")
.withColumn("hasRed", array_contains('values, "red"))
// gives null for no option
val colorPicker = when('hasRed, "red")
val result = df.join(redOrNot, "fn")
.withColumn("resultColor", colorPicker)
.withColumn("color", coalesce('resultColor, 'color)) // skips nulls that leads to the answer
.select('id, 'fn, 'color)
result
看起来如下(这似乎是一个答案):
scala> result.show
+---+---+-----+
| id| fn|color|
+---+---+-----+
| 1| fn| red|
| 2| fn| red|
| 3| fn| red|
| 4| aa| blue|
| 5| aa|green|
| 6| bb| red|
| 7| bb| red|
| 8| aa| blue|
+---+---+-----+
您可以链接when
运算符,并使用otherwise
的默认值。咨询scaladoc of when
operator。
我认为你可以使用窗口操作符或用户定义的聚合函数(UDAF)做一些非常相似(也许更有效)的事情,但是......好吧......目前还不知道怎么做。将评论留在这里以激励他人; - )
P.S。学到了很多!谢谢你的想法!
答案 1 :(得分:8)
高效的解决方案,不需要昂贵的分组:
// All groups with `red`
df.where($"color" === "red").select($"fn".alias("fn_")).distinct
// Join with input
.join(df.as("df"), $"fn" === $"fn_", "rightouter")
// Replace `color`
.withColumn("color", when($"fn_"isNull, $"color").otherwise(lit("red")))
.drop("fn_")
答案 2 :(得分:4)
如果DataFrame满足某个属性,则会有条件地更新它。在这种情况下,属性为"颜色列包含红色'"。表达这一点的惯用方法是使用所需的谓词进行过滤,然后确定是否有任何行满足它。没有必要加入。
import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.DataFrame
def makeAllRedIfAnyAreRed(df: DataFrame) = {
val containsRed = df.filter(df("color") === "red").count() > 0
if (containsRed) df.withColumn("color", lit("red")) else df
}
答案 3 :(得分:2)
由于过滤后的数据框中可能只有很少的行,因此我添加了isin()
和.withColumn()
组合的解决方案。
示例DataFrame
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "dept", "color")
现在,我们只选择至少有一个红色 dept
行的color
,并将其放在broadcast
变量中,如下所示。< / p>
val depts = sc.broadcast(df.filter($"color" === "red").select(collect_set("dept")).first.getSeq[String](0)))
更新已过滤的depts
条记录的红色颜色。
isin()
采用vararg转换列表为vararg(depts.value:_*
)
//creating new column by giving diff name (clr) to see the diff
val result = df.withColumn("clr", when($"dept".isin(depts.value:_*),lit("red"))
.otherwise($"color"))
result.show()
+---+----+-----+-----+
| id|dept|color| clr|
+---+----+-----+-----+
| 1| fn| red| red|
| 2| fn| blue| red|
| 3| fn|green| red|
| 4| aa| blue| blue|
| 5| aa|green|green|
| 6| bb| red| red|
| 7| bb| red| red|
| 8| aa| blue| blue|
+---+----+-----+-----+
答案 4 :(得分:1)
Spark 2.2.0: 样本数据帧(取自上面的例子)
val df = Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
).toDF("id", "dept", "color")
通过检查条件创建了一个UDF来执行更新。
val replace_val = udf((x: String,y:String) => if (Option(x).getOrElse("").equalsIgnoreCase("fn")&&(!y.equalsIgnoreCase("red"))) "red" else y)
val final_df = df.withColumn("color", replace_val($"dept",$"color"))
final_df.show()
输出:
火花1.6:val conf = new SparkConf().setMaster("local").setAppName("My app")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
// For implicit conversions like converting RDDs to DataFrames
val df = sc.parallelize(Seq(
(1, "fn", "red"),
(2, "fn", "blue"),
(3, "fn", "green"),
(4, "aa", "blue"),
(5, "aa", "green"),
(6, "bb", "red"),
(7, "bb", "red"),
(8, "aa", "blue")
) ).toDF("id","dept","color")
val replace_val = udf((x: String,y:String) => if (Option(x).getOrElse("").equalsIgnoreCase("fn")&&(!y.equalsIgnoreCase("red"))) "red" else y)
val final_df = df.withColumn("color", replace_val($"dept",$"color"))
final_df.show()